Skip to content

Entire pool drained because frontend requested removal of running pilots #282

Open
@mmascher

Description

@mmascher

Describe the bug
During the upgrade to condor 10 one of the CMS frontends, the one exclusively serving the CERN pool, decided to ask the factory for removal of pilots, and basically the whole pool was drained.

Here is the glideclient classad:

Name = "A2PYTHBM_CMSHTPC_T2_CH_CERN_ce503@gfactory_instance@CERN-Prod@CMS_T0-Frontend.local_users"
ReqEncIdentity = "1328afb1bb4af2ca7cee421201a7048d39979808963e4a5cb6397fa37f726187"
ReqEncKeyCode = "b50f881fad085e7d60897dd6475fed04da2ad9a87b515b4bdfbada74ab206dcad2892e286a31fb4490d747697bc5cff2f80499804c192424a065e522f4e242f4535b0484e0eb88b4995573b6910ce72da77db131b3731f07d2394d06e5dd92b0fa0927bfec452e6f321fb5f0146348ab7bf7b159b9e781123e0bea3314f0259b59c939398763b725363e8347683e37307e5997439cfb6d41753748798aecb599fa45905385e77e576c9edee6ec2df4aec2454f697d597e5da7196e938e78661fedd2cb0a0ddfa1afc80c361a6e99dae04c37d0530d7403dbb6cf6f53f1b41171f6282e4299c4ef98e935fcd10a7bf98b95f582be4effe0acd21093eb367658b4"
ReqGlidein = "CMSHTPC_T2_CH_CERN_ce503@gfactory_instance@CERN-Prod"
ReqIdleGlideins = 1
ReqIdleLifetime = "82800"
ReqMaxGlideins = 6
ReqName = "CMSHTPC_T2_CH_CERN_ce503@gfactory_instance@CERN-Prod"
ReqPubKeyID = "56f865cdbf4628ec88d04b792c7daed7"
ReqRemoveExcess = "ALL"
ReqRemoveExcessMargin = 0
UpdateSequenceNumber = 0
UpdatesHistory = "00000000000000000000000000000000"
UpdatesLost = 0
UpdatesSequenced = 0
UpdatesTotal = 22682
WebDescriptFile = "description.n3u71g.cfg"
WebDescriptSign = "d041d24fff115c12e014a9e333aeb67214101183"
WebGroupDescriptFile = "description.n3u71g.cfg"
WebGroupDescriptSign = "dfd6ef2137b3a544abd1fffe38230a3ac3d1b72b"
WebGroupURL = "http://vocms0819.cern.ch/vofrontend/stage/group_local_users"
WebMonitoringURL = "http://vocms0819.cern.ch/vofrontend/monitor"
WebSignType = "sha1"
WebURL = "http://vocms0819.cern.ch/vofrontend/stage"

And here is the factory logs:

[mmascher@vocms0207 ~]$ grep "Client CMS_T0-Frontend.CERN_condor\ " /var/log/gwms-factory/server/entry_CMSHTPC_T2_CH_CERN_ce513/CMSHTPC_T2_CH_CERN_ce513.info.log* -h | grep remove\ excess
...
[2023-03-29 18:33:24,597] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'IDLE', remove_excess_margin 5
[2023-03-29 18:34:23,535] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'IDLE', remove_excess_margin 5
[2023-03-29 18:35:23,421] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'IDLE', remove_excess_margin 5
[2023-03-29 18:36:25,936] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'ALL', remove_excess_margin 0
[2023-03-29 18:37:43,116] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'ALL', remove_excess_margin 0
[2023-03-29 18:39:48,317] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'ALL', remove_excess_margin 0
[2023-03-29 18:40:56,850] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'ALL', remove_excess_margin 0

The frontend is configured with <glideins_removal margin="5" requests_tracking="True" type="IDLE" wait="0"/>.

Due to the upgrade and GSI being disabled, the frontend relied on the idtoken to talk to the Collector and the Schedd. However, due to a puppet misconfiguration, the token was the wrong one and the frontend could not talk to them.

Is it possible that this code was execured?

# no requests and no glidein registered
# no harm getting rid of everything
remove_excess_running = True

Maybe the python bindings were not returning an exception in this case and just returned empty dictionaries?

To Reproduce
I'd start by checking if the frontend is asking for "ALL" removal when you change the idtoken.

Expected behavior
If the frontend cannot query the collector and the schedds it should not go ahead and do the requests to the factory at all. Maybe the code should return an exception and even exit?
utputs to help explain your problem.

Info (please complete the following information):

  • Priority: high
  • Stakeholders: CMS
  • Components: frontend

Metadata

Metadata

Assignees

No one assigned

    Labels

    BUGFor BUGSHighHigh prioritycmsCMS stakeholderfrontendfor affected component

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions