|
1 | | -# CM-service Architecture |
| 1 | +# CM-Service Architecture |
2 | 2 |
|
3 | 3 | ```{abstract} |
4 | 4 | Overview of the role of the campaign management tool cm-service. |
5 | 5 | ``` |
6 | 6 |
|
7 | | -## Add content here |
| 7 | +## CM-Service |
8 | 8 |
|
9 | | -See the [Documenteer documentation](https://documenteer.lsst.io/technotes/index.html) for tips on how to write and configure your new technote. |
| 9 | +The Campaign Management Team has been developing an orchestration tool called |
| 10 | +cm-service to support running large processing campaigns. This system is |
| 11 | +designed to automate as much of the team’s work as possible, enabling more |
| 12 | +efficient processing and a faster time to completion for campaigns. |
| 13 | + |
| 14 | +Rubin Campaigns are organized as a “workflow of workflows”, where an overall |
| 15 | +data release production campaign is broken down into a series of sequential |
| 16 | +phases known as “steps”, and within each step a set of batch submissions are |
| 17 | +executed on subsets of the data until all data have completed that step. These |
| 18 | +individual batch submissions are also directed acyclic graph workflows (“quantum |
| 19 | +graphs”), but the dependencies inside that graph are managed by the execution |
| 20 | +framework and not campaign management’s tooling. |
| 21 | + |
| 22 | +```{figure} steps_groups.png |
| 23 | +
|
| 24 | +Overview of the structure of a campaign in Rubin processing. |
| 25 | +``` |
| 26 | + |
| 27 | +## Goals for CM-service |
| 28 | + |
| 29 | +1. Manage the submission of workflows to the batch systems, automatically |
| 30 | +submitting downstream steps after precursor steps have finished |
| 31 | + |
| 32 | +2. Decompose large steps into a set of smaller batch submissions (“groups”), |
| 33 | +which can be submitted at a managed rate over many days or weeks. |
| 34 | + |
| 35 | +3. For multi-site campaigns, coordinate which workflow submissions go to which |
| 36 | +sites, and trigger automatic batch submissions to those sites. Trigger data |
| 37 | +transfers as necessary. |
| 38 | + |
| 39 | +4. Provide visibility into the status of the campaign processing, reporting |
| 40 | +which campaigns are in progress and the status of their steps and groups. |
| 41 | + |
| 42 | +An implementation goal is to avoid cross-facility dependencies: each site will |
| 43 | +have downtime, but the other sites must be able to continue processing during |
| 44 | +that downtime. |
| 45 | + |
| 46 | +## Architecture |
| 47 | + |
| 48 | +The cm-service consists of four components: |
| 49 | +- An API service, which users can interact with via a command line interface to |
| 50 | +configure and manage campaigns |
| 51 | +- A back-end database to store campaign configuration and state |
| 52 | +- A web front-end, initially for monitoring campaign state but which could |
| 53 | +eventually support the same capabilities as the command line interface |
| 54 | +- A daemon service which effects the actual processing steps, based on |
| 55 | +information in the campaign database. |
| 56 | + |
| 57 | +## Campaign Configuration |
| 58 | + |
| 59 | +The CM-service database stores both the configuration and state of each |
| 60 | +campaign. This configuration system will be described in detail in a future DM |
| 61 | +tech note. |
| 62 | + |
| 63 | +## Interfaces to other services |
| 64 | + |
| 65 | +The user interaction with cm-service takes place via the API server or the Web |
| 66 | +UI, while interactions with data facility services are handled by the CM Daemon. |
| 67 | + |
| 68 | +All of the processing that is triggered by cm-service is run via BPS. BPS |
| 69 | +provides an abstraction layer for submitting workflows to a variety of workflow |
| 70 | +management systems, and also provides a consistent interface for reporting the |
| 71 | +status of those workflows back to cm-service. BPS has plugins for submitting |
| 72 | +jobs via Panda, HTCondor, and Parsl, all of which are in active use. |
| 73 | + |
| 74 | +Cm-service interacts directly with the butler for updating and modifying butler |
| 75 | +collections, and possibly for validation that input datasets are ready for |
| 76 | +processing. |
| 77 | + |
| 78 | +Cm-service also interacts with rucio to define rucio datasets and add |
| 79 | +replication rules for data transfers after workflows have finished processing. |
| 80 | +It may also be necessary for cm-service to insert rucio rules for the transfer |
| 81 | +of input data for a workflow, either as part of a fan-out of (e.g.) global |
| 82 | +calibration products, or if the user chooses to reallocate where some processing |
| 83 | +is performed. |
| 84 | + |
| 85 | +## Example processing of a single workflow |
| 86 | + |
| 87 | +This sequence chart shows the interactions between the daemon and the various |
| 88 | +services it uses to process a single workflow and transfer the results. |
| 89 | + |
| 90 | + |
| 91 | +```{figure} seq_chart.png |
| 92 | +
|
| 93 | +Illustration of the interactions necessary for cm-service to process a workflow. |
| 94 | +``` |
| 95 | + |
| 96 | +Each of these actions have failure modes that cm-service needs to recover from; |
| 97 | +for example, the batch jobs may complete while the Rucio service is down, or |
| 98 | +butler collection management steps may fail if the butler registry is down. |
| 99 | +Cm-service is designed to track each of these necessary actions so that it can |
| 100 | +recover after services are restored, without needing to redo processing. |
| 101 | + |
| 102 | +## Open Questions |
| 103 | + |
| 104 | +While we have identified some of the pre- and post-batch-submission tasks which |
| 105 | +must be completed to support multi-site processing, as we begin to test more |
| 106 | +multi-site workflows it is likely that we will need to add or modify the work |
| 107 | +done between batch submissions. |
| 108 | + |
| 109 | +Before beginning a processing step, it is critical that all of the input data |
| 110 | +are present as expected, including input images, calibration products, reference |
| 111 | +catalogs, and the butler collections that make all of these inputs accessible to |
| 112 | +the pipelines. Errors in these inputs can result in invalid output products, so |
| 113 | +it is critical to verify these before beginning. This may require additional |
| 114 | +interactions with Rucio or the butler at the start of a step. |
0 commit comments