Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opt into purging by destroy for container entities, use delete elsewhere #23389

Merged
merged 8 commits into from
Mar 25, 2025

Conversation

jrafanie
Copy link
Member

@jrafanie jrafanie commented Mar 20, 2025

Fixes most of the issues in #23307

We were leaving around lots of orphaned container* rows when we removed the container entities. This change allows us to opt-into use destroy on the primary table in situations where know the associated records are NOT going to be many tens of thousands of rows. If they are, those associations should NOT be using dependent :destroy, and have their own purger.

TODO:

Note: I've moved additional tasks for more non-container purging to #23394

  • Verify and test dependent => destroy on associations from (except metrics, metric_rollups, vim_performance_states handled by separate purger):
    • models
      • container
      • container_group
      • container_image
      • container_node
      • container_project
      • container_quota
      • container_quota_item
    • associations (known to be previously orphaned)
      • container_condition (many millions)
        • container_group
        • container_node
      • container_env_var (millions)
        • container
      • container_volume (millions)
        • container_group
        • persistent_volume_claim (NO change, claims can be used by other volumes)
      • security_context (millions)
        • container
      • container_port_config (million)
        • container
      • custom_attribute (many millions)
        • via CustomAttributeMixin
      • guest_applications (million)
        • container_image
  • Other associations
    • containers
      • container_image (NO change, handled by purger)
    • container_builds
      • container_project (NO change, handled by refresh)
    • container_build_pods
      • container_build (NO change, handled by refresh)
    • container_group
      • container_build_pod (NO change, handled by purger)
      • container_replicator (NO change, handled by purger)
    • container_groups -> {active} ?
      • container_node (NO change, handled by purger)
      • container_project (NO change, handled by purger)
    • container_images (nullify)
      • container_image_registry
    • container_limits
      • container_project (NO change, handled by refresh)
    • container_quotas
      • container_project (NO change, handled by purger)
    • container_replicators
      • container_project (NO change, handled by refresh)
    • container_routes
      • container_project (NO change, handled by refresh)
      • container_service (NO change, handled by refresh)
    • container_services
      • container_project (NO change, handled by refresh)
      • container_image_registry (NO change, handled by refresh)
    • container_templates
      • container_project (NO change, handled by refresh)
    • container_volumes
      • persistent_volume_claim (NO change, this is fine, container volume will remove pvc when it's removed)
    • custom_attributes (no purger for model)
      • container_build (destroyed from project purger now via destroy)
      • container_build_pod (destroyed from project -> container_build project purger now via destroy)
      • container_replicator (destroyed from project purger now via destroy)
      • container_route (destroyed from project purger now via destroy)
      • container_service (destroyed from project purger now via destroy)
      • container_template (destroyed from project purger now via destroy)
    • metric_rollups, metrics, vim_performance_states (nullify)
      • container_image
    • miq_alert_statuses
      • container_node
    • persistent_volume_claims
      • container_project (NO change, this is fine, handled by container volume when it's removed)

@Fryguy
Copy link
Member

Fryguy commented Mar 20, 2025

@jrafanie Just for clarification, this changes everything to use destroy as part of the purger, but what schedules the purging of those? Or is that still TODO?

Is the plan to let things orphan out the way they do now and then change that in a followup? Or is the plan to switch to destroy and also change the models to do dependent => destroy?

@jrafanie
Copy link
Member Author

@jrafanie Just for clarification, this changes everything to use destroy as part of the purger, but what schedules the purging of those? Or is that still TODO?

Just the container entities handled by the purger were changed to destroy. Everything else is still delete. The schedules remain the same:

      :container_entities_purge_interval: 1.day

under the covers, the scheduler calls the same methods with the same interval. When they're executed in batches, they'll now use destroy.

Is the plan to let things orphan out the way they do now and then change that in a followup? Or is the plan to switch to destroy and also change the models to do dependent => destroy?

Most of container* associated with these container entities that were missed as highlighted in #23307, such as container conditions/volume/etc. should already be dependent destroy from the entities (container/groups/node/project/etc.)

@jrafanie
Copy link
Member Author

Is the plan to let things orphan out the way they do now and then change that in a followup? Or is the plan to switch to destroy and also change the models to do dependent => destroy?

Most of container* associated with these container entities that were missed as highlighted in #23307, such as container conditions/volume/etc. should already be dependent destroy from the entities (container/groups/node/project/etc.)

FYI, I added a bullet list to the PR description of what container entities and associations I'm verifying.

@jrafanie jrafanie force-pushed the purge-revamp branch 2 times, most recently from 0cd93cb to 0311e21 Compare March 20, 2025 22:43
@Fryguy
Copy link
Member

Fryguy commented Mar 21, 2025

I'd like @agrare to also review this so it plays nice with the refresh workers possibly deleting entities.

@agrare
Copy link
Member

agrare commented Mar 21, 2025

so it plays nice with the refresh workers possibly deleting entities.

Anything being considered for purging should already have been disconnected/archived by the RefreshWorker so there should be no contention here.

      def purge_scope(older_than)
        where(arel_table[:deleted_on].lteq(older_than))
      end 

@@ -38,9 +38,9 @@ class ContainerImage < ApplicationRecord
:inverse_of => :resource
has_one :last_scan_result, :class_name => "ScanResult", :as => :resource, :dependent => :destroy, :autosave => true

has_many :metric_rollups, :as => :resource, :dependent => :nullify, :inverse_of => :resource
has_many :metrics, :as => :resource, :dependent => :nullify, :inverse_of => :resource
has_many :vim_performance_states, :as => :resource, :dependent => :nullify, :inverse_of => :resource
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why these were nullify

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah weird - in other models we just let those be orphaned, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this is the only one that does nullify on metrics* and vim_perf*

@@ -2,13 +2,13 @@ class ContainerImageRegistry < ApplicationRecord
belongs_to :ext_management_system, :foreign_key => "ems_id"

# Associated with images in the registry.
has_many :container_images, :dependent => :nullify
Copy link
Member Author

@jrafanie jrafanie Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be handled by the purger as long as refresh marks them as archived via deleted_on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ContainerImages are also related to Containers which could still be around after a container_image_registry is deleted. I think dependent nullify is appropriate here

@@ -44,7 +44,7 @@ class ContainerNode < ApplicationRecord
has_many :metrics, :as => :resource
has_many :metric_rollups, :as => :resource
has_many :vim_performance_states, :as => :resource
has_many :miq_alert_statuses, :as => :resource
has_many :miq_alert_statuses, :as => :resource, :dependent => :destroy
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why these 2 were not being destroyed ☝️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing thousands of these likely because nodes don't have the same fast lifecycle as containers/pods... should be handled by existing node purgers once this is in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this looks like a good fix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: consider a data migration to do orphan cleanup for anything like this that was already removed and associated rows left orphaned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a bullet item to do a data migration in the high level issue: #23394

has_many :all_container_groups, :class_name => "ContainerGroup", :inverse_of => :container_project
has_many :archived_container_groups, -> { archived }, :class_name => "ContainerGroup"
has_many :persistent_volume_claims
has_many :persistent_volume_claims, :dependent => :destroy
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAT, yes, we probably don't have that many container projects but still, I think these should all be destroyed... otherwise the purger for projects will just leave these orphaned.

@agrare do any of these have a chance to be impossible to delete in the UI/backend due to many tens of thousands of rows? Maybe builds? I'm not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

persistent_volume_claims are associated to a container_volume which does the dependent destroy,

app/models/container_volume.rb:  belongs_to :persistent_volume_claim, :dependent => :destroy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -6,7 +6,7 @@ class ContainerService < ApplicationRecord

belongs_to :ext_management_system, :foreign_key => "ems_id"
has_and_belongs_to_many :container_groups, :join_table => :container_groups_container_services
has_many :container_routes
has_many :container_routes, :dependent => :destroy
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this one correct @agrare ? projects has many routes and services has many routes? Should projects have routes through services? Or maybe there can be routes not attached to a service? 🤷

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we leaving routes orphaned @jrafanie ?

A container route is its own "top-level" managed object so we would destroy it when we get the destroy event from k8s (versus e.g. a container which is only part of a container_group and wouldn't get its own destroy event) I'm surprised it isn't at least dependent nullify though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I'm seeing less than 200 routes where container conditions, env vars, volumes, security contexts, port configs, and custom attributes are the big ones in the container area over 1 million rows.

I can change it to nullify. Are there others here that should be treated in the same way? I don't want to change behavior. I'll verify with my table counts from various databases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept it the same and added a comment

@@ -1,7 +1,7 @@
class PersistentVolumeClaim < ApplicationRecord
belongs_to :ext_management_system, :foreign_key => "ems_id"
belongs_to :container_project
has_many :container_volumes
has_many :container_volumes, :dependent => :destroy
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why this wasn't being destroyed... too many maybe? We don't have a separate purger for volumes though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A persistent volume claim will point to a persistent volume when the claim is satisfied, but it doesn't own the volume. The volume could be reused by a future claim, so deleting the claim leaves the volume around.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or more accurately, if the PVC is marked as "Retain", then on deleting it won't also delete the PV.

Kubernetes is hard this way, because technically all these objects are loosely bound to each other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a comment. That does make sense when thinking about a claim.

@@ -9,7 +9,7 @@ class ContainerBuild < ApplicationRecord
:as => :resource,
:inverse_of => :resource

has_many :container_build_pods
has_many :container_build_pods, :dependent => :destroy
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seeing only 10s of these, perhaps they're managed elsewhere, such as delete evens.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

container_build_pods are a top level managed object which should be deleted during refresh

has_many :containers, :through => :container_images
has_many :container_groups, :through => :container_images

# Associated with serving the registry itself - for openshift's internal
# image registry. These will be empty for external registries.
has_many :container_services
has_many :container_services, :dependent => :destroy
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seeing tens of registries to many hundreds of services. Should be handled if registries are removed with the container manager.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ContainerServices should be deleted by refresh when they are removed

@Fryguy
Copy link
Member

Fryguy commented Mar 21, 2025

I'm starting to wonder if this is a refresh problem. I thought just about everything in Kubernetes was a "top-level" object, since you can create objects willy-nilly that do anything, and there are loose associations between many things by using labels and selectors. There's not really ownership references between them. Are we just missing events during refresh for destroying these orphaned thing, or perhaps they just aren't in the events/watches?

has_many :container_routes
has_many :container_replicators
has_many :container_services
has_many :container_routes, :dependent => :destroy
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many 10s of these. Maybe handled by event handling or just lower lifecycle churn.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

container_routes are a top-level managed entity that should be deleted by refresh when they are removed from k8s

has_many :container_replicators
has_many :container_services
has_many :container_routes, :dependent => :destroy
has_many :container_replicators, :dependent => :destroy
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 of these

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

has_many :container_services
has_many :container_routes, :dependent => :destroy
has_many :container_replicators, :dependent => :destroy
has_many :container_services, :dependent => :destroy
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See ☝️ registries: https://github.com/ManageIQ/manageiq/pull/23389/files#r2007811459

There are hundreds of these to hundreds projects. Low churn or handled elsewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

has_many :containers, :through => :container_groups
has_many :container_images, -> { distinct }, :through => :container_groups
has_many :container_nodes, -> { distinct }, :through => :container_groups
has_many :container_quotas, -> { active }, :inverse_of => :container_project
has_many :container_quota_scopes, :through => :container_quotas
has_many :container_quota_items, :through => :container_quotas
has_many :container_limits
has_many :container_limits, :dependent => :destroy
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 from example data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are a top-level object which should be deleted by the refresher

@Fryguy
Copy link
Member

Fryguy commented Mar 21, 2025

Oh, the k8s uid is a really good indicator - I assume they'd always be the uid_ems column, so @jrafanie if you see that you can figure out which category they belong in.

@agrare
Copy link
Member

agrare commented Mar 21, 2025

Oh, the k8s uid is a really good indicator - I assume they'd always be the uid_ems column

ems_ref column but yes that is probably 90-95% definitive (container_images come to mind as the exception that proves the rule)

@@ -6,7 +6,7 @@ class ContainerService < ApplicationRecord

belongs_to :ext_management_system, :foreign_key => "ems_id"
has_and_belongs_to_many :container_groups, :join_table => :container_groups_container_services
has_many :container_routes
has_many :container_routes, :dependent => :destroy # TODO: consider nullify for things like routes, which should be removed by event handling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Routes are a top-level openshift collection that should be deleted by the refresher

@jrafanie jrafanie force-pushed the purge-revamp branch 2 times, most recently from bb2d6d3 to 473e094 Compare March 24, 2025 19:38
destroyed = batch_records.destroy_all
destroyed.detect { |d| !d.destroyed? }.tap do |failed|
raise "failed removing record: #{failed.class.name} with id: #{failed.id} with error: #{failed.errors.full_messages}" if failed
end
Copy link
Member Author

@jrafanie jrafanie Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test this, this is what I did. I added this to fail to delete containers if their port configs still exist... It's a silly failure but gives an idea of how it looks in the logs:

index f2093b1fa4..63352d093c 100644
--- a/app/models/container.rb
+++ b/app/models/container.rb
@@ -10,7 +10,7 @@ class Container < ApplicationRecord
   has_one    :container_replicator, :through => :container_group
   has_one    :container_project, :through => :container_group
   belongs_to :container_image
-  has_many   :container_port_configs, :dependent => :destroy
+  has_many   :container_port_configs, :dependent => :restrict_with_error
   has_many   :container_env_vars, :dependent => :destroy
   has_one    :container_image_registry, :through => :container_image
   has_one    :security_context, :as => :resource, :dependent => :destroy
diff --git a/config/settings.yml b/config/settings.yml
[----] I, [2025-03-24T18:05:13.358085#93726:88cc]  INFO -- evm: MIQ(Container.purge_by_date) Purging containers older than [2025-03-23 22:05:13 UTC]...
[----] I, [2025-03-24T18:05:13.358273#93726:88cc]  INFO -- evm: MIQ(Container.purge_in_batches) Purging 1000 containers.
[----] E, [2025-03-24T18:05:15.513443#93726:88cc] ERROR -- evm: MIQ(MiqQueue#deliver) Message id: [25979608], Error: [failed removing record: ManageIQ::Providers::Openshift::ContainerManager::Container with id: 1835 with error: ["Cannot delete record because dependent manageiq::providers::openshift::containermanager::container: container port configs exist"]]
[----] E, [2025-03-24T18:05:15.513538#93726:88cc] ERROR -- evm: [RuntimeError]: failed removing record: ManageIQ::Providers::Openshift::ContainerManager::Container with id: 1835 with error: ["Cannot delete record because dependent manageiq::providers::openshift::containermanager::container: container port configs exist"]  Method:[block (2 levels) in <class:LogProxy>]
[----] E, [2025-03-24T18:05:15.513581#93726:88cc] ERROR -- evm: /Users/joerafaniello/Code/manageiq/app/models/mixins/purging_mixin.rb:235:in `block (2 levels) in purge_in_batches'
<internal:kernel>:90:in `tap'
/Users/joerafaniello/Code/manageiq/app/models/mixins/purging_mixin.rb:234:in `block in purge_in_batches'
<internal:kernel>:187:in `loop'
/Users/joerafaniello/Code/manageiq/app/models/mixins/purging_mixin.rb:208:in `purge_in_batches'
/Users/joerafaniello/Code/manageiq/app/models/mixins/purging_mixin.rb:93:in `purge_by_date'
/Users/joerafaniello/Code/manageiq/app/models/miq_queue.rb:517:in `block in dispatch_method'
/Users/joerafaniello/.gem/ruby/3.3.6/gems/timeout-0.4.3/lib/timeout.rb:185:in `block in timeout'
/Users/joerafaniello/.gem/ruby/3.3.6/gems/timeout-0.4.3/lib/timeout.rb:38:in `handle_timeout'
/Users/joerafaniello/.gem/ruby/3.3.6/gems/timeout-0.4.3/lib/timeout.rb:194:in `timeout'
/Users/joerafaniello/Code/manageiq/app/models/miq_queue.rb:515:in `dispatch_method'
/Users/joerafaniello/Code/manageiq/app/models/miq_queue.rb:484:in `block in deliver'
/Users/joerafaniello/Code/manageiq/app/models/user.rb:390:in `with_user_group'
/Users/joerafaniello/Code/manageiq/app/models/miq_queue.rb:484:in `deliver'
/Users/joerafaniello/Code/manageiq/app/models/miq_queue.rb:508:in `deliver_and_process'
/Users/joerafaniello/Code/manageiq/lib/vmdb/console_methods.rb:67:in `block in simulate_queue_worker'

@jrafanie jrafanie mentioned this pull request Mar 25, 2025
13 tasks
Fixes most of the issues in ManageIQ#23307

We were leaving around lots of orphaned container* rows when we removed
the container entities.  This change allows us to opt-into use destroy
on the primary table in situations where know the associated records
are NOT going to be many tens of thousands of rows.  If they are, those
associations should NOT be using dependent :destroy, and have their own
purger.
@jrafanie jrafanie changed the title [WIP] Opt into purging by destroy for container entities, use delete elsewhere Opt into purging by destroy for container entities, use delete elsewhere Mar 25, 2025
@jrafanie jrafanie removed the wip label Mar 25, 2025
Add comment about pvcs from projects as pvcs are removed via the
container volume belongs to.

Add commment about pvc having container_volumes that can live on their own
and be used by a different claim, no need to delete the volume when a
claim is removed.

Leave it to the purger where we already have purgers

No other model nullifies metrics|states.
@jrafanie jrafanie force-pushed the purge-revamp branch 2 times, most recently from 385b3b9 to 47b2ba1 Compare March 25, 2025 17:50
@@ -1,6 +1,19 @@
RSpec.describe PurgingMixin do
let(:example_class) { PolicyEvent }
let(:purge_date) { 2.weeks.ago }
purge_by_delete_classes, purge_by_destroy_classes = ActiveRecord::Base.descendants.select { |m| m.ancestors.include?(PurgingMixin) && m.base_model == m }.partition { |m| m.purge_method == :delete }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add the base_model check as we don't need to test all the descendant classes such as:

  ManageIQ::Providers::Azure::ContainerManager::ContainerGroup.purge_method is destroy
  ManageIQ::Providers::Kubernetes::ContainerManager::ContainerGroup.purge_method is destroy
  ManageIQ::Providers::Vmware::ContainerManager::ContainerGroup.purge_method is destroy
  ManageIQ::Providers::OracleCloud::ContainerManager::Container.purge_method is destroy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add base_model? check and use it here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah as discussed there's a base_class and base_class? method, so base_model should match that pattern with a base_model? method. Then we can use that here.

@@ -3,6 +3,9 @@ module Purging
extend ActiveSupport::Concern
include PurgingMixin

# According to 022e15256fd07fa7bf5b3ade7ce16b13daa87b84
# This is necessary because ContainerQuotaItem may be archived due to edits
# to parent ContainerQuota that is still alive.
Copy link
Member Author

@jrafanie jrafanie Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a bullet item to review if archiving/purging is needed for container quota/quota scopes/quota items in #23394

@@ -1,7 +1,8 @@
module ActiveRecord
class Base
class << self
alias_method :base_model, :base_class
alias_method :base_model, :base_class
alias_method :base_model?, :base_class?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the new alias to simplify the purging_mixin test 👇

@Fryguy Fryguy merged commit 27854bb into ManageIQ:master Mar 25, 2025
8 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants