fix(removal): add conditional lifecycle guards for storage attachment advancement#21750
fix(removal): add conditional lifecycle guards for storage attachment advancement#21750
Conversation
23aa970 to
6049a64
Compare
domain/removal/state/model/unit.go
Outdated
| "DELETE FROM unit_agent_presence WHERE unit_uuid = $entityUUID.uuid", | ||
| "DELETE FROM secret_unit_consumer WHERE unit_uuid = $entityUUID.uuid", | ||
| "DELETE FROM storage_unit_owner WHERE unit_uuid = $entityUUID.uuid", | ||
| "DELETE FROM storage_attachment WHERE unit_uuid = $entityUUID.uuid", |
There was a problem hiding this comment.
Storage attachment is its own entity, it should not be deleted here, that is like deleting the machine because the unit is being deleted.
| "DELETE FROM storage_attachment WHERE unit_uuid = $entityUUID.uuid", |
There was a problem hiding this comment.
This is no longer applicable, I applied the fix you had requested in the other PR (#21761 (comment)) and make the storage attachment removal in this patch now. See cdc6e00.
6049a64 to
cdc6e00
Compare
| MachineScopeProvisioned: dbVal.ProvisionScopeID == 1, | ||
| StorageAttachmentDeadOrGone: dbVal.StorageAttachmentLifeID == int(life.Dead) || dbVal.StorageAttachmentLifeID == -1, | ||
| MachineGone: dbVal.MachineGone == 1, |
There was a problem hiding this comment.
This is business logic in state, I know I'm being pedantic, but these should just be passed back.
| // The following errors may be returned: | ||
| // - [storageerrors.StorageAttachmentNotFound] if the storage attachment | ||
| // no longer exists in the model. | ||
| func (st *State) GetStorageAttachmentHookInfo( |
There was a problem hiding this comment.
| func (st *State) GetStorageAttachmentHookInfo( | |
| func (st *State) GetUniterInternalStorageStateByStorageAttachmentUUID( |
| CASE WHEN mf.machine_uuid IS NULL THEN 1 | ||
| WHEN m.uuid IS NULL THEN 1 | ||
| ELSE 0 |
| FROM storage_filesystem_attachment sfa | ||
| JOIN storage_filesystem sf ON sfa.storage_filesystem_uuid = sf.uuid | ||
| LEFT JOIN storage_instance_filesystem sif ON sf.uuid = sif.storage_filesystem_uuid | ||
| LEFT JOIN storage_attachment sa ON sif.storage_instance_uuid = sa.storage_instance_uuid |
There was a problem hiding this comment.
We don't care about the storage attachment life, because the filesystem attachment should only go to dying/dead once the storage attachment is dead.
| q := ` | ||
| SELECT sf.provision_scope_id AS &provisionedAttachmentAdvanceInfo.provision_scope_id, | ||
| COALESCE(sa.life_id, -1) AS &provisionedAttachmentAdvanceInfo.storage_attachment_life_id, | ||
| CASE WHEN mf.machine_uuid IS NULL THEN 1 |
There was a problem hiding this comment.
machine_filesystem is a filesystem owned by the machine, if we've deleted this row, we no longer are owned by a machine and we should not do anything with this filesystem attachment.
The machine_filesystem should only be removed when the filesystem that is owned by the machine is removed, i.e. before the machine.
| SELECT sf.provision_scope_id AS &provisionedAttachmentAdvanceInfo.provision_scope_id, | ||
| COALESCE(sa.life_id, -1) AS &provisionedAttachmentAdvanceInfo.storage_attachment_life_id, | ||
| CASE WHEN mf.machine_uuid IS NULL THEN 1 | ||
| WHEN m.uuid IS NULL THEN 1 |
There was a problem hiding this comment.
This is tautological business logic.
| SELECT sv.provision_scope_id AS &provisionedAttachmentAdvanceInfo.provision_scope_id, | ||
| COALESCE(sa.life_id, -1) AS &provisionedAttachmentAdvanceInfo.storage_attachment_life_id, | ||
| CASE WHEN mv.machine_uuid IS NULL THEN 1 | ||
| WHEN m.uuid IS NULL THEN 1 | ||
| ELSE 0 | ||
| END AS &provisionedAttachmentAdvanceInfo.machine_gone | ||
| FROM storage_volume_attachment sva | ||
| JOIN storage_volume sv ON sva.storage_volume_uuid = sv.uuid | ||
| LEFT JOIN storage_instance_volume siv ON sv.uuid = siv.storage_volume_uuid | ||
| LEFT JOIN storage_attachment sa ON siv.storage_instance_uuid = sa.storage_instance_uuid | ||
| LEFT JOIN machine_volume mv ON sv.uuid = mv.volume_uuid | ||
| LEFT JOIN machine m ON mv.machine_uuid = m.uuid | ||
| WHERE sva.uuid = $entityUUID.uuid |
There was a problem hiding this comment.
Same comments as for filesystems.
| } | ||
| var state map[string]bool | ||
| if err := yaml.Unmarshal([]byte(storageStateYAML), &state); err != nil { | ||
| return false |
There was a problem hiding this comment.
If you get an error, that is an error, it is not proof that the hook did not fire.
| ) | ||
| } | ||
|
|
||
| if !hookInfo.UnitDead && storageAttachedHookFired(hookInfo.StorageID, hookInfo.StorageState) { |
There was a problem hiding this comment.
- These should be two different errors.
- This is saying if a unit is alive and the storage attached hook fired then exit. This is wrong, it should be if the unit is not dead OR the storageAttachedHookFired.
There was a problem hiding this comment.
To me this is logically equivalent to the rule you states in previous PR.
#21761 (comment)
Storage Attachments MUST not go to Dead until the Uniter says it is Dead or the removal was forceful. The exception here would be if the Uniter never started or the Storage Attached hook never fired.
Here we check that the unit is not dead (so it has started) and that the storageFired
- !(Started && fired) === !notStarted || !fired
So if started && fired means we are in the exception case were storage attachement cannot go to dead.
| return info.MachineScopeProvisioned && | ||
| info.StorageAttachmentDeadOrGone && | ||
| info.MachineGone |
There was a problem hiding this comment.
For the attachment to be dying we must have a dead storage attachment, so we don't need to check the storage attachment dead or gone.
For the reset, the logic is:
- Filesystem Attachment is Machine Provisioned and
- Filesystem for the Filesystem attachment is Machine owned -or-
The Volume backing the Filesystem is Machine Provisioned and is Machine owned. - The Machine is Dead.
There was a problem hiding this comment.
For the Volume attachment to be dying we must have a dead storage attachment, so we don't need to check the storage attachment dead or gone.
For the reset, the logic is:
- The Volume for the Volume attachment is Machine Provisioned and is Machine owned.
- The Machine is Dead.
advancement Storage attachments, filesystem attachments, and volume attachments were being unconditionally advanced from dying to dead in removal job processors. This violates lifecycle rules when provisioners may still be responsible for them. Add guards so that: - Storage attachment dying->dead: only if force, unit is dead, or the storage-attached hook never fired (checked via unit_state YAML) - Filesystem/volume attachment dying->dead: only if force, or the storage attachment is dead/gone AND the filesystem/volume is machine-scoped AND the owning machine is gone - Filesystem/volume attachments are now marked as dead before deletion when dying Also remove storage_attachment and storage_unit_owner cleanup from deleteForeignKeyUnitReferences, since storage attachments are their own entities with independent lifecycles and must be removed through their own removal jobs before unit deletion.
cdc6e00 to
f1a0c35
Compare
gfouillet
left a comment
There was a problem hiding this comment.
QA ok, will take another look when Harry's comments will be addressed.
|
|
||
| // StorageAttachmentHookInfo contains the information required to determine | ||
| // if a dying storage attachment can be safely advanced to dead. | ||
| type StorageAttachmentHookInfo struct { |
There was a problem hiding this comment.
nitpick: Naming.
Both StorageAttachmentHookInfo and ProvisionedAttachmentAdvanceInfo have the same responsibility (holding data to decide wether a thing can be advanced to dead.)
Hook is contextual information that might not remains correct.
| type StorageAttachmentHookInfo struct { | |
| type StorageAttachmentAdvanceInfo struct { |
|
|
||
| // StorageAttachmentDeadOrGone is true when the associated storage | ||
| // attachment is dead or no longer exists. | ||
| StorageAttachmentDeadOrGone bool |
There was a problem hiding this comment.
thought: I am unsure about this naming, which is very precise and leak implementation (which may or may not be a good thing)
I wonder if using a semantically less precise wording for all the boolean gates here could be more informative
Instead of having Dead/Gone naming, maybe taking a step back and use "Hold" or similar word that ship the idea of dependencies between entities.
Just a thought, then, if no idea came up i am fine.
| storageerrors "github.com/juju/juju/domain/storage/errors" | ||
| storageprovisioningerrors "github.com/juju/juju/domain/storageprovisioning/errors" | ||
| "github.com/juju/juju/internal/errors" | ||
| "gopkg.in/yaml.v3" |
| ) | ||
| } | ||
|
|
||
| if !hookInfo.UnitDead && storageAttachedHookFired(hookInfo.StorageID, hookInfo.StorageState) { |
There was a problem hiding this comment.
To me this is logically equivalent to the rule you states in previous PR.
#21761 (comment)
Storage Attachments MUST not go to Dead until the Uniter says it is Dead or the removal was forceful. The exception here would be if the Uniter never started or the Storage Attached hook never fired.
Here we check that the unit is not dead (so it has started) and that the storageFired
- !(Started && fired) === !notStarted || !fired
So if started && fired means we are in the exception case were storage attachement cannot go to dead.
| SELECT u.life_id AS &storageAttachmentUnitInfo.unit_life_id, | ||
| si.storage_id AS &storageAttachmentUnitInfo.storage_id, | ||
| COALESCE(us.storage_state, '') AS &storageAttachmentUnitInfo.storage_state | ||
| FROM storage_attachment sa | ||
| JOIN unit u ON sa.unit_uuid = u.uuid | ||
| JOIN storage_instance si ON sa.storage_instance_uuid = si.uuid | ||
| LEFT JOIN unit_state us ON sa.unit_uuid = us.unit_uuid | ||
| WHERE sa.uuid = $entityUUID.uuid | ||
| ` |
There was a problem hiding this comment.
todo:
| SELECT u.life_id AS &storageAttachmentUnitInfo.unit_life_id, | |
| si.storage_id AS &storageAttachmentUnitInfo.storage_id, | |
| COALESCE(us.storage_state, '') AS &storageAttachmentUnitInfo.storage_state | |
| FROM storage_attachment sa | |
| JOIN unit u ON sa.unit_uuid = u.uuid | |
| JOIN storage_instance si ON sa.storage_instance_uuid = si.uuid | |
| LEFT JOIN unit_state us ON sa.unit_uuid = us.unit_uuid | |
| WHERE sa.uuid = $entityUUID.uuid | |
| ` | |
| SELECT u.life_id AS &storageAttachmentUnitInfo.unit_life_id, | |
| si.storage_id AS &storageAttachmentUnitInfo.storage_id, | |
| COALESCE(us.storage_state, '') AS &storageAttachmentUnitInfo.storage_state | |
| FROM storage_attachment AS sa | |
| JOIN unit AS u ON sa.unit_uuid = u.uuid | |
| JOIN storage_instance AS si ON sa.storage_instance_uuid = si.uuid | |
| LEFT JOIN unit_state AS us ON sa.unit_uuid = us.unit_uuid | |
| WHERE sa.uuid = $entityUUID.uuid |
This is a good thing to follow the sqlfluff rules we enforce in our DDL, even in embedded query
This patch ensures we remove the storage attachments before attempting to remove the machine. This had to be done from the removal domain. Initially this was done in a separate patch (#21761).
Storage attachments, filesystem attachments, and volume attachments were
being unconditionally advanced from dying to dead in removal job
processors. This violates lifecycle rules when provisioners may still be
responsible for them.
Add guards so that:
storage-attached hook never fired (checked via unit_state YAML)
storage attachment is dead/gone AND the filesystem/volume is
machine-scoped AND the owning machine is gone
when dying
The rules requested in #21761 (comment) were taken into account to fulfill the storage attachment removal: new types were added and queries to ensure that the storage attachment hook was not fired, we query the state of the storage attachment before deciding to advance its life.
QA steps
First pack the
dummy-storagecharm (from /testcharms). Then deploy it and remove the app before the machine reaches the RUNNING state:You shouldn't see any errors in the logs and the app/unit/machine should be removed (after a few moments).
Links
Issue: Fixes #21717.
Jira card: JUJU-9147