-
Notifications
You must be signed in to change notification settings - Fork 106
Description
Issue
The cc.api_post_start_healthcheck_timeout_in_seconds field doesn't seem to have any effect.
Context
The cc.api_post_start_healthcheck_timeout_in_seconds config field seems to imply that the post-start will fail if CCNG fails report healthy within that time frame:
capi-release/jobs/cloud_controller_ng/spec
Lines 349 to 351 in 6f0f64c
| cc.api_post_start_healthcheck_timeout_in_seconds: | |
| default: 60 | |
| description: "Maximum time (in seconds) for cloud_controller_ng to report healthy" |
However, we pushed a changed to CCNG that slept for 10 hours before starting the thin server (in the runner), and this timeout did not get triggered. Instead, 20 minutes passed until both ccng_monit_http_healthcheck and nginx_cc failed.
Task 277 | 22:03:30 | L starting jobs: api/abf448d3-74e1-4086-80c2-130814373e14 (0) (canary) Task 277 | 22:03:57 | Updating instance scheduler: scheduler/7d611e5a-1ada-41a4-b811-49b5dbcb2b2f (0) (canary) (00:02:09)
Task 277 | 22:23:32 | Updating instance api: api/abf448d3-74e1-4086-80c2-130814373e14 (0) (canary) (00:21:44)
L Error: 'api/abf448d3-74e1-4086-80c2-130814373e14 (0)' is not running after update. Review logs for failed jobs: ccng_monit_http_healthcheck, nginx_cc
Task 277 | 22:23:32 | Error: 'api/abf448d3-74e1-4086-80c2-130814373e14 (0)' is not running after update. Review logs for failed jobs: ccng_monit_http_healthcheck, nginx_cc
So it seems this check in the post start script:
| wait_for_server_to_become_healthy "https://<%= discover_external_ip %>:${PORT}/healthz" "<%= p("cc.api_post_start_healthcheck_timeout_in_seconds") %>" |
doesn't seem to matter anymore.
However, trying this experiment on older versions of CAPI, we observer that the deploy fails in the post-start script after the configured time.
We believe this regression was introduced in this PR: https://github.com/cloudfoundry/capi-release/pull/195/files, which reconfigured the monit dependencies between the processes.
What we are not clear on is if this is a regression that reintroduced the issue which prompted the introduction of that post-start check in the first place: #125.
Steps to Reproduce
Add a long sleep to this line: https://github.com/cloudfoundry/cloud_controller_ng/blob/main/lib%2Fcloud_controller%2Frunner.rb#L88 and deploy
Expected result
The post-start script to fail after the configured cc.api_post_start_healthcheck_timeout_in_seconds value
Current result
Deploy fails after 20 minutes.
Possible Fix
Not sure, is this something we need to address?