Description
Describe the bug
During the upgrade to condor 10 one of the CMS frontends, the one exclusively serving the CERN pool, decided to ask the factory for removal of pilots, and basically the whole pool was drained.
Here is the glideclient
classad:
Name = "A2PYTHBM_CMSHTPC_T2_CH_CERN_ce503@gfactory_instance@CERN-Prod@CMS_T0-Frontend.local_users"
ReqEncIdentity = "1328afb1bb4af2ca7cee421201a7048d39979808963e4a5cb6397fa37f726187"
ReqEncKeyCode = "b50f881fad085e7d60897dd6475fed04da2ad9a87b515b4bdfbada74ab206dcad2892e286a31fb4490d747697bc5cff2f80499804c192424a065e522f4e242f4535b0484e0eb88b4995573b6910ce72da77db131b3731f07d2394d06e5dd92b0fa0927bfec452e6f321fb5f0146348ab7bf7b159b9e781123e0bea3314f0259b59c939398763b725363e8347683e37307e5997439cfb6d41753748798aecb599fa45905385e77e576c9edee6ec2df4aec2454f697d597e5da7196e938e78661fedd2cb0a0ddfa1afc80c361a6e99dae04c37d0530d7403dbb6cf6f53f1b41171f6282e4299c4ef98e935fcd10a7bf98b95f582be4effe0acd21093eb367658b4"
ReqGlidein = "CMSHTPC_T2_CH_CERN_ce503@gfactory_instance@CERN-Prod"
ReqIdleGlideins = 1
ReqIdleLifetime = "82800"
ReqMaxGlideins = 6
ReqName = "CMSHTPC_T2_CH_CERN_ce503@gfactory_instance@CERN-Prod"
ReqPubKeyID = "56f865cdbf4628ec88d04b792c7daed7"
ReqRemoveExcess = "ALL"
ReqRemoveExcessMargin = 0
UpdateSequenceNumber = 0
UpdatesHistory = "00000000000000000000000000000000"
UpdatesLost = 0
UpdatesSequenced = 0
UpdatesTotal = 22682
WebDescriptFile = "description.n3u71g.cfg"
WebDescriptSign = "d041d24fff115c12e014a9e333aeb67214101183"
WebGroupDescriptFile = "description.n3u71g.cfg"
WebGroupDescriptSign = "dfd6ef2137b3a544abd1fffe38230a3ac3d1b72b"
WebGroupURL = "http://vocms0819.cern.ch/vofrontend/stage/group_local_users"
WebMonitoringURL = "http://vocms0819.cern.ch/vofrontend/monitor"
WebSignType = "sha1"
WebURL = "http://vocms0819.cern.ch/vofrontend/stage"
And here is the factory logs:
[mmascher@vocms0207 ~]$ grep "Client CMS_T0-Frontend.CERN_condor\ " /var/log/gwms-factory/server/entry_CMSHTPC_T2_CH_CERN_ce513/CMSHTPC_T2_CH_CERN_ce513.info.log* -h | grep remove\ excess
...
[2023-03-29 18:33:24,597] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'IDLE', remove_excess_margin 5
[2023-03-29 18:34:23,535] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'IDLE', remove_excess_margin 5
[2023-03-29 18:35:23,421] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'IDLE', remove_excess_margin 5
[2023-03-29 18:36:25,936] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'ALL', remove_excess_margin 0
[2023-03-29 18:37:43,116] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'ALL', remove_excess_margin 0
[2023-03-29 18:39:48,317] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'ALL', remove_excess_margin 0
[2023-03-29 18:40:56,850] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'ALL', remove_excess_margin 0
The frontend is configured with <glideins_removal margin="5" requests_tracking="True" type="IDLE" wait="0"/>
.
Due to the upgrade and GSI being disabled, the frontend relied on the idtoken to talk to the Collector and the Schedd. However, due to a puppet misconfiguration, the token was the wrong one and the frontend could not talk to them.
Is it possible that this code was execured?
glideinwms/frontend/glideinFrontendElement.py
Lines 1809 to 1811 in 2fe2468
Maybe the python bindings were not returning an exception in this case and just returned empty dictionaries?
To Reproduce
I'd start by checking if the frontend is asking for "ALL" removal when you change the idtoken.
Expected behavior
If the frontend cannot query the collector and the schedds it should not go ahead and do the requests to the factory at all. Maybe the code should return an exception and even exit?
utputs to help explain your problem.
Info (please complete the following information):
- Priority: high
- Stakeholders: CMS
- Components: frontend