- 
                Notifications
    
You must be signed in to change notification settings  - Fork 198
 
Description
Kibana version:
9.2.0.
Elasticsearch version:
9.2.0
Original install method (e.g. download page, yum, from source, etc.):
ESS/ECH
Describe the bug:
The agent migration feature sort of breaks if you specify an invalid/revoked token.
It gets stuck on that action.
Subsequent migration attempts do nothing.
No impact if you:
- restart the agent itself
 - restart Kibana
 - restart fleet
 - cancel the migration agent action
 
However, I can get it working again if I switch the agent to a different policy, restart the agent, and try the migration again.
I reproduced this on two separate clusters where I was testing the feature and migrating the agent back and forth.
This does not seem to occur on all types of failures. I forced a connectivity problem (disabled routing to the new fleet server) and the migration task failed and completed, as one would expect, and allowed me to successfully retry.
Sidenote:
Sometimes the agent would stop checking into the original fleet and go offline (though ingest continued to work) in which case an agent restart on the host was needed to bring it back online.
Steps to reproduce:
- Install and Enroll elastic agent into fleet
 - Attempt to migrate the agent but specify an known invalid string for the token (any arbitrary text will do)
- This failed migration will indefinitely remain in an IN_PROGRESS state
 
 - Attempt to migrate the agent again using a valid token
- this action will never run and also remain IN_PROGRESS indefinitely
 
 - Attach the agent to any other agent policy
 - retry the migration (this succeeds)
 
Expected behavior:
Failed agent tasks should not run indefinitely
Subsequent migration attempts should be able to run without any extra shenanigans
Screenshots (if relevant):
Sorry for the size of the screenshot.  I had to zoom out a lot to capture the info from the screenshot

Agent Action cancellation request succeeded...
but continues IN_PROGRESS as do the subsequent corrected attempts
Provide logs and/or server output (if relevant):
Agent logs indicate the initial bad token migration continues to retry the new enrollment hours later even after retrying with corrected token.
Oct 28, 2025 @ 10:05:53.395	elastic_agent	(null)	Error detected: fail to execute request to fleet-server: status code: 401, fleet-server returned an error: ErrUnauthorized, message: unauthorized, will retry in a moment.	
Oct 28, 2025 @ 10:05:52.919	elastic_agent	(null)	Retrying enrollment to URL: https://[redacted].fleet.us-central1.gcp.cloud.es.io//	
Oct 28, 2025 @ 09:57:47.717	elastic_agent	(null)	Error detected: fail to execute request to fleet-server: status code: 401, fleet-server returned an error: ErrUnauthorized, message: unauthorized, will retry in a moment.	
Oct 28, 2025 @ 09:57:47.246	elastic_agent	(null)	Retrying enrollment to URL: https://[redacted].fleet.us-central1.gcp.cloud.es.io//	
Oct 28, 2025 @ 09:48:37.934	elastic_agent	(null)	Error detected: fail to execute request to fleet-server: status code: 401, fleet-server returned an error: ErrUnauthorized, message: unauthorized, will retry in a moment.	
Oct 28, 2025 @ 09:48:37.475	elastic_agent	(null)	Retrying enrollment to URL: https://[redacted].fleet.us-central1.gcp.cloud.es.io//	
Oct 28, 2025 @ 09:40:14.098	elastic_agent	(null)	Error detected: fail to execute request to fleet-server: status code: 401, fleet-server returned an error: ErrUnauthorized, message: unauthorized, will retry in a moment.	
Oct 28, 2025 @ 09:40:13.359	elastic_agent	(null)	Retrying enrollment to URL: https://[redacted].fleet.us-central1.gcp.cloud.es.io//	
Oct 28, 2025 @ 09:32:55.673	elastic_agent	(null)	Error detected: fail to execute request to fleet-server: status code: 401, fleet-server returned an error: ErrUnauthorized, message: unauthorized, will retry in a moment.	
Oct 28, 2025 @ 09:32:55.179	elastic_agent	(null)	Retrying enrollment to URL: https://[redacted].fleet.us-central1.gcp.cloud.es.io//	
Oct 28, 2025 @ 09:31:57.318	elastic_agent	(null)	Checkin request to fleet-server succeeded after 1 failures	
Oct 28, 2025 @ 09:20:04.255	elastic_agent	(null)	Possible transient error during checkin with fleet-server, retrying	
Oct 28, 2025 @ 08:44:43.471	elastic_agent	(null)	Error detected: fail to execute request to fleet-server: status code: 401, fleet-server returned an error: ErrUnauthorized, message: unauthorized, will retry in a moment.	
Oct 28, 2025 @ 08:44:42.542	elastic_agent	(null)	Retrying enrollment to URL: https://[redacted].fleet.us-central1.gcp.cloud.es.io//	
Oct 28, 2025 @ 08:37:22.637	elastic_agent	(null)	Error detected: fail to execute request to fleet-server: status code: 401, fleet-server returned an error: ErrUnauthorized, message: unauthorized, will retry in a moment.	
Oct 28, 2025 @ 08:37:22.164	elastic_agent	(null)	Retrying enrollment to URL: https://[redacted].fleet.us-central1.gcp.cloud.es.io//	
Oct 28, 2025 @ 08:31:26.486	elastic_agent	(null)	Error detected: fail to execute request to fleet-server: status code: 401, fleet-server returned an error: ErrUnauthorized, message: unauthorized, will retry in a moment.	
Oct 28, 2025 @ 08:31:26.008	elastic_agent	(null)	Retrying enrollment to URL: https://[redacted].fleet.us-central1.gcp.cloud.es.io//	
Oct 28, 2025 @ 08:27:33.981	elastic_agent	(null)	Error detected: fail to execute request to fleet-server: status code: 401, fleet-server returned an error: ErrUnauthorized, message: unauthorized, will retry in a moment.	
Oct 28, 2025 @ 08:27:33.492	elastic_agent	(null)	Retrying enrollment to URL: https://[redacted].fleet.us-central1.gcp.cloud.es.io//	
Oct 28, 2025 @ 08:25:34.593	elastic_agent	(null)	Error detected: fail to execute request to fleet-server: status code: 401, fleet-server returned an error: ErrUnauthorized, message: unauthorized, will retry in a moment.	
Oct 28, 2025 @ 08:25:34.593	elastic_agent	(null)	1st enrollment attempt failed, retrying enrolling to URL: https://[redacted].fleet.us-central1.gcp.cloud.es.io// with exponential backoff (init 5s, max 10m0s)	Any additional context: