Skip to content

unable to bosh cck an unresponsive vm (very high cpu load) #2531

Open
@poblin-orange

Description

Describe the bug

On a overloaded vm, we met the following issue:

  • bosh vm is seen unresposive by the director
Task 5813416. Done
Deployment '00-shared-services-r2'
Instance                                                          Process State       AZ     IPs            Deployment  
services-agents-r2-z1/48bd83b9-22d8-469a-9425-2ca16412a79a        running             r2-z1  xx.xx.xx.6   00-shared-services-r2  
                                                                                             192.168.64.67    
services-agents-r2-z1/5082819e-3ddc-493e-a38c-3894a81f668e        unresponsive agent  r2-z1  192.168.64.68  00-shared-services-r2  
                                                                                             xx.xx.xx.7     
services-agents-r2-z1/f0764857-34a1-403c-8302-1b89671133b0        unresponsive agent  r2-z1  xx.xx.xx.5   00-shared-services-r2  
                                                                                             192.168.64.66    
services-agents-r2-z2/59810cf3-f7ab-46d7-bbff-1cefc536cfe3        running             r2-z2  192.168.64.74  00-shared-services-r2  
                                                                                             xx.xx.xx.9     
services-proxy-agents-r2-z1/fe212593-e542-4f0a-bcff-5e38485a6c73  unresponsive agent  r2-z1  192.168.64.73  00-shared-services-r2  
                                                                                             xx.xx.xx.8     
5 instances
Succeeded
  • technically, vm is up (ping / nc -vz / monit process up when looking)

  • however a bosh cck fails with the following error:

$ bosh cck
Using environment '192.168.99.152' as user 'yyyyy'
Using deployment '00-shared-services-r2'
Task 5813417
Task 5813417 | 18:27:48 | Scanning 5 VMs: Checking VM states (00:00:17)
                        L Error: Action Failed get_state: Getting processes status: Getting service status: Unmarshalling Monit status: unexpected EOF
Task 5813417 | 18:28:05 | Error: Action Failed get_state: Getting processes status: Getting service status: Unmarshalling Monit status: unexpected EOF
Task 5813417 Started  Wed Jun 12 18:27:48 UTC 2024
Task 5813417 Finished Wed Jun 12 18:28:05 UTC 2024
Task 5813417 Duration 00:00:17
Task 5813417 error
Performing a scan on deployment '00-shared-services-r2':
  Expected task '5813417' to succeed but state is 'error'
  • fails before any resolution can be chosen by operator => bosh cck is not usable
  • workaround:
    • bosh deploy <manfest.yml> --fix.
    • bosh is able to repair, recreating the unresponsive vm

To Reproduce
Steps to reproduce the behavior (example):

  1. Deploy a bosh director on with
  2. Upload and
  3. Deploy
  4. bosh ssh to a specific instance
  5. Run on the vm to see the behavior

Expected behavior
A clear and concise description of what you expected to happen.

Logs
Logs are always helpful! Add logs to help explain your problem.

Versions (please complete the following information):

  • Infrastructure: vsphere
  • BOSH version bosh/277.4.3
  • BOSH CLI version 7.5.6
  • Stemcell version bosh-vsphere-esxi-ubuntu-jammy-go_agent 1.465* ubuntu-jammy

Deployment info:
If possible, share your (redacted) manifest and any ops files used to deploy
BOSH or any other releases on top of BOSH.

If you used any deployment strategy it'd be helpful to point it out and share as
much about it as possible (e.g. bosh-deployment, PCF, genesis, spiff, etc)

Additional context
Add any other context about the problem here.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    • Status

      Waiting for Changes | Open for Contribution

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions