Description
Describe the bug
On a overloaded vm, we met the following issue:
- bosh vm is seen unresposive by the director
Task 5813416. Done
Deployment '00-shared-services-r2'
Instance Process State AZ IPs Deployment
services-agents-r2-z1/48bd83b9-22d8-469a-9425-2ca16412a79a running r2-z1 xx.xx.xx.6 00-shared-services-r2
192.168.64.67
services-agents-r2-z1/5082819e-3ddc-493e-a38c-3894a81f668e unresponsive agent r2-z1 192.168.64.68 00-shared-services-r2
xx.xx.xx.7
services-agents-r2-z1/f0764857-34a1-403c-8302-1b89671133b0 unresponsive agent r2-z1 xx.xx.xx.5 00-shared-services-r2
192.168.64.66
services-agents-r2-z2/59810cf3-f7ab-46d7-bbff-1cefc536cfe3 running r2-z2 192.168.64.74 00-shared-services-r2
xx.xx.xx.9
services-proxy-agents-r2-z1/fe212593-e542-4f0a-bcff-5e38485a6c73 unresponsive agent r2-z1 192.168.64.73 00-shared-services-r2
xx.xx.xx.8
5 instances
Succeeded
-
technically, vm is up (ping / nc -vz / monit process up when looking)
-
however a bosh cck fails with the following error:
$ bosh cck
Using environment '192.168.99.152' as user 'yyyyy'
Using deployment '00-shared-services-r2'
Task 5813417
Task 5813417 | 18:27:48 | Scanning 5 VMs: Checking VM states (00:00:17)
L Error: Action Failed get_state: Getting processes status: Getting service status: Unmarshalling Monit status: unexpected EOF
Task 5813417 | 18:28:05 | Error: Action Failed get_state: Getting processes status: Getting service status: Unmarshalling Monit status: unexpected EOF
Task 5813417 Started Wed Jun 12 18:27:48 UTC 2024
Task 5813417 Finished Wed Jun 12 18:28:05 UTC 2024
Task 5813417 Duration 00:00:17
Task 5813417 error
Performing a scan on deployment '00-shared-services-r2':
Expected task '5813417' to succeed but state is 'error'
- fails before any resolution can be chosen by operator => bosh cck is not usable
- workaround:
- bosh deploy <manfest.yml> --fix.
- bosh is able to repair, recreating the unresponsive vm
To Reproduce
Steps to reproduce the behavior (example):
- Deploy a bosh director on with
- Upload and
- Deploy
bosh ssh
to a specific instance- Run on the vm to see the behavior
Expected behavior
A clear and concise description of what you expected to happen.
Logs
Logs are always helpful! Add logs to help explain your problem.
Versions (please complete the following information):
- Infrastructure: vsphere
- BOSH version bosh/277.4.3
- BOSH CLI version 7.5.6
- Stemcell version bosh-vsphere-esxi-ubuntu-jammy-go_agent 1.465* ubuntu-jammy
Deployment info:
If possible, share your (redacted) manifest and any ops files used to deploy
BOSH or any other releases on top of BOSH.
If you used any deployment strategy it'd be helpful to point it out and share as
much about it as possible (e.g. bosh-deployment, PCF, genesis, spiff, etc)
Additional context
Add any other context about the problem here.
Metadata
Assignees
Labels
Type
Projects
Status
Waiting for Changes | Open for Contribution
Activity