Skip to content

JobSubmitter crashing due to issues with cmssdt.cern.ch #12419

Description

@hassan11196

Impact of the bug
WMAgent, JobSubmitter Component

Describe the bug
JobSubmitter dies if it fails to communicate with cmssdt.cern.ch to fetch MicroArch info

2025-07-31 14:42:14,628:139846332970688:INFO:JobSubmitterPoller:Done assigning site locations.
2025-07-31 14:44:26,862:139846332970688:ERROR:BossAirAPI:Unhandled exception while submitting jobs to plugin: SimpleCondorPlugin
(28, 'Failed to connect to cmssdt.cern.ch port 443 after 131861 ms: Could not connect to server')
2025-07-31 14:44:26,893:139846332970688:ERROR:BaseWorkerThread:Error in worker algorithm (1):
Backtrace:
  <WMComponent.JobSubmitter.JobSubmitterPoller.JobSubmitterPoller object at 0x7f309a618e00> <@========== WMException Start ==========@>
Exception Class: BossAirException
Message: Unhandled exception while submitting jobs to plugin: SimpleCondorPlugin
(28, 'Failed to connect to cmssdt.cern.ch port 443 after 131861 ms: Could not connect to server')
        ClassName : None
        ModuleName : WMCore.BossAir.BossAirAPI
        MethodName : submit
        ClassInstance : None
        FileName : /usr/local/lib/python3.12/site-packages/WMCore/BossAir/BossAirAPI.py
        LineNumber : 396
        ErrorNr : 0

Traceback: 
  File "/usr/local/lib/python3.12/site-packages/WMCore/BossAir/BossAirAPI.py", line 382, in submit
    localSuccess, localFailure = pluginInst.submit(jobs=jobsToSubmit,
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 166, in submit
    (sub, jobParams) = self.createSubmitRequest(jobsReady)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 701, in createSubmitRequest
    jobParameters = self.getJobParameters(jobList)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 507, in getJobParameters
    rel_microarchs = self.tc.defaultMicroArchVersionNumberByRelease()
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/TagCollector/TagCollector.py", line 125, in defaultMicroArchVersionNumberByRelease
    for row in self.data():
               ^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/TagCollector/TagCollector.py", line 81, in data
    data = self._getResult()
           ^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/TagCollector/TagCollector.py", line 66, in _getResult
    f = self.refreshCache(cFile, callname, args, encoder=encoder, decoder=decodeBytesToUnicode,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/Service.py", line 218, in refreshCache
    self.getData(cachefile, url, inputdata, incoming_headers, encoder, decoder, verb, contentType, binary=binary)

  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/Service.py", line 318, in getData
    data, dummyStatus, dummyReason, from_cache = self["requests"].makeRequest(uri=url,
                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/Requests.py", line 185, in makeRequest
    result, response = self.makeRequest_pycurl(uri, data, verb, headers)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/Requests.py", line 202, in makeRequest_pycurl
    response, result = self.reqmgr.request(uri, data, headers, verb=verb,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/Utils/PortForward.py", line 68, in portMangle
    return callFunc(callObj, url, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/pycurl_manager.py", line 338, in request
    curl.perform()

<@---------- WMException End ----------@>  File "/usr/local/lib/python3.12/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 183, in __call__
    tSpent, results, _ = algorithmWithDBExceptionHandler(parameters)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/WMCore/Database/DBExceptionHandler.py", line 43, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/Utils/Timers.py", line 57, in wrapper
    res = func(*arg, **kw)
          ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/WMComponent/JobSubmitter/JobSubmitterPoller.py", line 830, in algorithm
    self.submitJobs(jobsToSubmit=jobsToSubmit)
  File "/usr/local/lib/python3.12/site-packages/WMComponent/JobSubmitter/JobSubmitterPoller.py", line 757, in submitJobs
    successList, failList = self.bossAir.submit(jobs=jobList)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/WMCore/BossAir/BossAirAPI.py", line 396, in submit
    raise BossAirException(msg)

2025-07-31 14:44:26,894:139846332970688:INFO:Harness:>>>Terminating worker threads
2025-07-31 14:44:26,895:139846332970688:ERROR:BaseWorkerThread:Error in event loop (2): <WMComponent.JobSubmitter.JobSubmitterPoller.JobSubmitterPoller object at 0x7f309a618e00> <@========== WMException Start ==========@>
Exception Class: BossAirException
Message: Unhandled exception while submitting jobs to plugin: SimpleCondorPlugin
(28, 'Failed to connect to cmssdt.cern.ch port 443 after 131861 ms: Could not connect to server')
        ClassName : None
        ModuleName : WMCore.BossAir.BossAirAPI
        MethodName : submit
        ClassInstance : None
        FileName : /usr/local/lib/python3.12/site-packages/WMCore/BossAir/BossAirAPI.py
        LineNumber : 396
        ErrorNr : 0

Traceback: 
  File "/usr/local/lib/python3.12/site-packages/WMCore/BossAir/BossAirAPI.py", line 382, in submit
    localSuccess, localFailure = pluginInst.submit(jobs=jobsToSubmit,
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 166, in submit
    (sub, jobParams) = self.createSubmitRequest(jobsReady)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 701, in createSubmitRequest
    jobParameters = self.getJobParameters(jobList)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 507, in getJobParameters
    rel_microarchs = self.tc.defaultMicroArchVersionNumberByRelease()
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/TagCollector/TagCollector.py", line 125, in defaultMicroArchVersionNumberByRelease
    for row in self.data():
               ^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/TagCollector/TagCollector.py", line 81, in data
    data = self._getResult()
           ^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/TagCollector/TagCollector.py", line 66, in _getResult
    f = self.refreshCache(cFile, callname, args, encoder=encoder, decoder=decodeBytesToUnicode,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/Service.py", line 218, in refreshCache
    self.getData(cachefile, url, inputdata, incoming_headers, encoder, decoder, verb, contentType, binary=binary)

  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/Service.py", line 318, in getData
    data, dummyStatus, dummyReason, from_cache = self["requests"].makeRequest(uri=url,
                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/Requests.py", line 185, in makeRequest
    result, response = self.makeRequest_pycurl(uri, data, verb, headers)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/Requests.py", line 202, in makeRequest_pycurl
    response, result = self.reqmgr.request(uri, data, headers, verb=verb,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/Utils/PortForward.py", line 68, in portMangle
    return callFunc(callObj, url, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/pycurl_manager.py", line 338, in request
    curl.perform()

<@---------- WMException End ----------@>
Backtrace:
  File "/usr/local/lib/python3.12/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 209, in __call__
    raise ex
  File "/usr/local/lib/python3.12/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 183, in __call__
    tSpent, results, _ = algorithmWithDBExceptionHandler(parameters)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/WMCore/Database/DBExceptionHandler.py", line 43, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/Utils/Timers.py", line 57, in wrapper
    res = func(*arg, **kw)
          ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/WMComponent/JobSubmitter/JobSubmitterPoller.py", line 830, in algorithm
    self.submitJobs(jobsToSubmit=jobsToSubmit)
  File "/usr/local/lib/python3.12/site-packages/WMComponent/JobSubmitter/JobSubmitterPoller.py", line 757, in submitJobs
    successList, failList = self.bossAir.submit(jobs=jobList)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/WMCore/BossAir/BossAirAPI.py", line 396, in submit
    raise BossAirException(msg)

2025-07-31 14:44:26,925:139846332970688:INFO:BaseWorkerThread:Worker thread <WMComponent.JobSubmitter.JobSubmitterPoller.JobSubmitterPoller object at 0x7f309a618e00> terminated
^C

How to reproduce it
Steps to reproduce the behavior:

  1. Simulate cmssdt.cern.ch to be unavailable
  2. restart JobSubmitter and see it fail to submit jobs.

Expected behavior
Jobsubmitter to fail submission and try again.

Additional context and error message
None

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
Done
Status
Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions