Entire pool drained because frontend requested removal of running pilots

**Describe the bug**
During the upgrade to condor 10 one of the CMS frontends, the one exclusively serving the CERN pool, decided to ask the factory for removal of pilots, and basically the whole pool was drained.

Here is the `glideclient` classad:

```
Name = "A2PYTHBM_CMSHTPC_T2_CH_CERN_ce503@gfactory_instance@CERN-Prod@CMS_T0-Frontend.local_users"
ReqEncIdentity = "1328afb1bb4af2ca7cee421201a7048d39979808963e4a5cb6397fa37f726187"
ReqEncKeyCode = "b50f881fad085e7d60897dd6475fed04da2ad9a87b515b4bdfbada74ab206dcad2892e286a31fb4490d747697bc5cff2f80499804c192424a065e522f4e242f4535b0484e0eb88b4995573b6910ce72da77db131b3731f07d2394d06e5dd92b0fa0927bfec452e6f321fb5f0146348ab7bf7b159b9e781123e0bea3314f0259b59c939398763b725363e8347683e37307e5997439cfb6d41753748798aecb599fa45905385e77e576c9edee6ec2df4aec2454f697d597e5da7196e938e78661fedd2cb0a0ddfa1afc80c361a6e99dae04c37d0530d7403dbb6cf6f53f1b41171f6282e4299c4ef98e935fcd10a7bf98b95f582be4effe0acd21093eb367658b4"
ReqGlidein = "CMSHTPC_T2_CH_CERN_ce503@gfactory_instance@CERN-Prod"
ReqIdleGlideins = 1
ReqIdleLifetime = "82800"
ReqMaxGlideins = 6
ReqName = "CMSHTPC_T2_CH_CERN_ce503@gfactory_instance@CERN-Prod"
ReqPubKeyID = "56f865cdbf4628ec88d04b792c7daed7"
ReqRemoveExcess = "ALL"
ReqRemoveExcessMargin = 0
UpdateSequenceNumber = 0
UpdatesHistory = "00000000000000000000000000000000"
UpdatesLost = 0
UpdatesSequenced = 0
UpdatesTotal = 22682
WebDescriptFile = "description.n3u71g.cfg"
WebDescriptSign = "d041d24fff115c12e014a9e333aeb67214101183"
WebGroupDescriptFile = "description.n3u71g.cfg"
WebGroupDescriptSign = "dfd6ef2137b3a544abd1fffe38230a3ac3d1b72b"
WebGroupURL = "http://vocms0819.cern.ch/vofrontend/stage/group_local_users"
WebMonitoringURL = "http://vocms0819.cern.ch/vofrontend/monitor"
WebSignType = "sha1"
WebURL = "http://vocms0819.cern.ch/vofrontend/stage"
```

And here is the factory logs:
```
[mmascher@vocms0207 ~]$ grep "Client CMS_T0-Frontend.CERN_condor\ " /var/log/gwms-factory/server/entry_CMSHTPC_T2_CH_CERN_ce513/CMSHTPC_T2_CH_CERN_ce513.info.log* -h | grep remove\ excess
...
[2023-03-29 18:33:24,597] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'IDLE', remove_excess_margin 5
[2023-03-29 18:34:23,535] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'IDLE', remove_excess_margin 5
[2023-03-29 18:35:23,421] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'IDLE', remove_excess_margin 5
[2023-03-29 18:36:25,936] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'ALL', remove_excess_margin 0
[2023-03-29 18:37:43,116] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'ALL', remove_excess_margin 0
[2023-03-29 18:39:48,317] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'ALL', remove_excess_margin 0
[2023-03-29 18:40:56,850] INFO: Client CMS_T0-Frontend.CERN_condor (secid: CMS_T0-Frontend_cmspilot) requesting 0 glideins, max running 0, idle lifetime 82800, remove excess 'ALL', remove_excess_margin 0
```

The frontend is configured with `<glideins_removal margin="5" requests_tracking="True" type="IDLE" wait="0"/>`.

Due to the upgrade and GSI being disabled, the frontend relied on the idtoken to talk to the Collector and the Schedd. However, due to a puppet misconfiguration, the token was the wrong one and the frontend could not talk to them.

Is it possible that this code was execured?

https://github.com/glideinWMS/glideinwms/blob/2fe2468e5d2da0ec962433c4ab92185052281b31/frontend/glideinFrontendElement.py#L1809-L1811

Maybe the python bindings were not returning an exception in this case and just returned empty dictionaries?

**To Reproduce**
I'd start by checking if the frontend is asking for "ALL" removal when you change the idtoken.

**Expected behavior**
If the frontend cannot query the collector and the schedds it should not go ahead and do the requests to the factory at all. Maybe the code should return an exception and even exit?
utputs to help explain your problem.

**Info (please complete the following information):**
-   Priority: high
-   Stakeholders: CMS
-   Components: frontend


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Entire pool drained because frontend requested removal of running pilots #282

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	# no requests and no glidein registered
	# no harm getting rid of everything
	remove_excess_running = True

Entire pool drained because frontend requested removal of running pilots #282

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions