Skip to content

design doc ‐ job table

Dario Mapelli edited this page May 7, 2024 · 1 revision

Design document about the introduction of a new table in the Oracle DB for the bookkeeping of the jobs.

goals

  • reduce the barrier to add new features
  • I want to assume that the latest and most complete set of information about a job is in the DB, not in a cached file on a scheduler.

design

  • all the job bookkeeping should be centralized in a single table, so that every portion of crab that needs to know that status of a job can query the oracle DB
  • after jobsplitting, TW creates an row for every job (retry 0).
  • dagman, prejob, postjob only touch the DB. no need for status_cache, runs_and_lumis.tar.gz, ...
  • crab status, crab report, ... should only read from the DB, not from status_cache, ...

option 1: one row for every CRAB job, with information about last retry only (if the value for the column is retrycount=4, then it means that the schedulers will have job_log.(0|1|2|3|4).txt). table primary key: (taskname, crab job id).

  • pro: less rows.
  • con: we lose a bit of history

option2: one row for every job on the scheduler. table primary key: (taskname, crab job id, retry count)

  • pro: better history tracking (site where the job run, ...)
  • con: need to check more than one row to look for the latest retry of a job.

dario prefers option 2.

implementation

Start adding information in the new table

  1. add the new table in the DB
  2. make the TW add one row per job after jobsplitting

identify all places where bookkeeping is done via editing the following files

  • status_cache.pkl
  • runs_and_lumis.tar.gz
  • ...

for example, make sure that bookkeeping is done both with these files and via upgrading the job row in the DB in:

  • dagman
  • postjob
  • ...

Then, adapt existing code to use the new source of information, instead of status_cache.pkl, ...

  • crab status should use information in the DB. Dario would like the summer student to get at least to this point.
  • crab report should use information from the DB
  • crab recovery should use information from the DB

when we are confident that no portion of crab is using the “old” files for bookkeeping:

  • remove the code that consumes the bookkeeping from files (crab status, ...)
  • remove the code that produces the bookkeeping into the files (dagman, postjob, ...): no more status_cache, for example

may never do:

  • do not remove runs_and_lumis.tar.gz: there may be users who rely on it! we may want to educate them, but we can not drop it altogether