Skip to content

Factory monitor not updating after factory lockup #324

@mmascher

Description

@mmascher

Describe the bug
The UCSD factory machine locked up for an unknown reason (possibly a cooling issue in the room). Once the machine recovered the monitor was not available. Turned out some monitor cache file were empty and the factory was not expecting that.

To Reproduce
Run the factory for a while and then make one of the ftspk file empty, for example:
/var/log/gwms-factory/server/entry_CMSHTPC_T2_US_Purdue_Negishi_Op/condor_activity_20230729_UCSD-CMS-Frontend.main.log.fecmsucsd.ftstpk

Expected behavior
The corner case should be handled correctly and monitor available.

Info (please complete the following information):

  • Priority: low
  • Stakeholders: FactoryOps
  • Components: factory monitoring

Additional context

...
[2023-07-29 23:37:07,142] DEBUG: glideFactoryEntry:1058: Checking security credentials for client UCSD-CMS-Frontend.main
[2023-07-29 23:37:07,218] ERROR: glideFactoryEntry:1819: Could not read /var/log/gwms-factory/server/entry_CMSHTPC_T2_US_Purdue_Negishi_Op/condor_activity_20230729_UCSD-CMS-Frontend.main.log.fecmsucsd.ftstpk
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/condorLogParser.py", line 1386, in loadCache
    data = util.file_pickle_load(fname)
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/util.py", line 306, in file_pickle_load
    conditional_raise(mask_exceptions)
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/util.py", line 295, in file_pickle_load
    data = pickle.load(fo)
EOFError: Ran out of input

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/glideinwms/factory/glideFactoryEntry.py", line 1817, in perform_work_v3
    log_stats[credential_username + ":" + client_int_name].load()
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/condorLogParser.py", line 671, in load
    obj.load()
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/condorLogParser.py", line 82, in load
    return self.loadCache()
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/condorLogParser.py", line 104, in loadCache
    self.data = loadCache(self.cachename)
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/condorLogParser.py", line 1388, in loadCache
    raise RuntimeError("Could not read %s" % fname)
RuntimeError: Could not read /var/log/gwms-factory/server/entry_CMSHTPC_T2_US_Purdue_Negishi_Op/condor_activity_20230729_UCSD-CMS-Frontend.main.log.fecmsucsd.ftstpk
[2023-07-29 23:38:34,834] DEBUG: glideFactoryEntry:1058: Checking security credentials for client UCSD-CMS-Frontend.main
...

Metadata

Metadata

Assignees

Labels

BUGFor BUGSLowLow priorityfactory-monfor affected componentfactoryopsFactory Operations stakeholder

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions