Skip to content

Commit 5d1d183

Browse files
authored
Merge pull request #1 from lsst-dm/tickets/DM-48994
DM-48994: write CM-service technote.
2 parents d850910 + 1480e05 commit 5d1d183

File tree

5 files changed

+113
-7
lines changed

5 files changed

+113
-7
lines changed

index.md

Lines changed: 108 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,114 @@
1-
# CM-service Architecture
1+
# CM-Service Architecture
22

33
```{abstract}
44
Overview of the role of the campaign management tool cm-service.
55
```
66

7-
## Add content here
7+
## CM-Service
88

9-
See the [Documenteer documentation](https://documenteer.lsst.io/technotes/index.html) for tips on how to write and configure your new technote.
9+
The Campaign Management Team has been developing an orchestration tool called
10+
cm-service to support running large processing campaigns. This system is
11+
designed to automate as much of the team’s work as possible, enabling more
12+
efficient processing and a faster time to completion for campaigns.
13+
14+
Rubin Campaigns are organized as a “workflow of workflows”, where an overall
15+
data release production campaign is broken down into a series of sequential
16+
phases known as “steps”, and within each step a set of batch submissions are
17+
executed on subsets of the data until all data have completed that step. These
18+
individual batch submissions are also directed acyclic graph workflows (“quantum
19+
graphs”), but the dependencies inside that graph are managed by the execution
20+
framework and not campaign management’s tooling.
21+
22+
```{figure} steps_groups.png
23+
24+
Overview of the structure of a campaign in Rubin processing.
25+
```
26+
27+
## Goals for CM-service
28+
29+
1. Manage the submission of workflows to the batch systems, automatically
30+
submitting downstream steps after precursor steps have finished
31+
32+
2. Decompose large steps into a set of smaller batch submissions (“groups”),
33+
which can be submitted at a managed rate over many days or weeks.
34+
35+
3. For multi-site campaigns, coordinate which workflow submissions go to which
36+
sites, and trigger automatic batch submissions to those sites. Trigger data
37+
transfers as necessary.
38+
39+
4. Provide visibility into the status of the campaign processing, reporting
40+
which campaigns are in progress and the status of their steps and groups.
41+
42+
An implementation goal is to avoid cross-facility dependencies: each site will
43+
have downtime, but the other sites must be able to continue processing during
44+
that downtime.
45+
46+
## Architecture
47+
48+
The cm-service consists of four components:
49+
- An API service, which users can interact with via a command line interface to
50+
configure and manage campaigns
51+
- A back-end database to store campaign configuration and state
52+
- A web front-end, initially for monitoring campaign state but which could
53+
eventually support the same capabilities as the command line interface
54+
- A daemon service which effects the actual processing steps, based on
55+
information in the campaign database.
56+
57+
## Campaign Configuration
58+
59+
The CM-service database stores both the configuration and state of each
60+
campaign. This configuration system will be described in detail in a future DM
61+
tech note.
62+
63+
## Interfaces to other services
64+
65+
The user interaction with cm-service takes place via the API server or the Web
66+
UI, while interactions with data facility services are handled by the CM Daemon.
67+
68+
All of the processing that is triggered by cm-service is run via BPS. BPS
69+
provides an abstraction layer for submitting workflows to a variety of workflow
70+
management systems, and also provides a consistent interface for reporting the
71+
status of those workflows back to cm-service. BPS has plugins for submitting
72+
jobs via Panda, HTCondor, and Parsl, all of which are in active use.
73+
74+
Cm-service interacts directly with the butler for updating and modifying butler
75+
collections, and possibly for validation that input datasets are ready for
76+
processing.
77+
78+
Cm-service also interacts with rucio to define rucio datasets and add
79+
replication rules for data transfers after workflows have finished processing.
80+
It may also be necessary for cm-service to insert rucio rules for the transfer
81+
of input data for a workflow, either as part of a fan-out of (e.g.) global
82+
calibration products, or if the user chooses to reallocate where some processing
83+
is performed.
84+
85+
## Example processing of a single workflow
86+
87+
This sequence chart shows the interactions between the daemon and the various
88+
services it uses to process a single workflow and transfer the results.
89+
90+
91+
```{figure} seq_chart.png
92+
93+
Illustration of the interactions necessary for cm-service to process a workflow.
94+
```
95+
96+
Each of these actions have failure modes that cm-service needs to recover from;
97+
for example, the batch jobs may complete while the Rucio service is down, or
98+
butler collection management steps may fail if the butler registry is down.
99+
Cm-service is designed to track each of these necessary actions so that it can
100+
recover after services are restored, without needing to redo processing.
101+
102+
## Open Questions
103+
104+
While we have identified some of the pre- and post-batch-submission tasks which
105+
must be completed to support multi-site processing, as we begin to test more
106+
multi-site workflows it is likely that we will need to add or modify the work
107+
done between batch submissions.
108+
109+
Before beginning a processing step, it is critical that all of the input data
110+
are present as expected, including input images, calibration products, reference
111+
catalogs, and the butler collections that make all of these inputs accessible to
112+
the pipelines. Errors in these inputs can result in invalid output products, so
113+
it is critical to verify these before beginning. This may require additional
114+
interactions with Rucio or the butler at the start of a step.

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
documenteer[technote]>=1.0.0a13
1+
documenteer[technote]<2

seq_chart.png

145 KB
Loading

steps_groups.png

192 KB
Loading

technote.toml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,6 @@ organization.name = "Vera C. Rubin Observatory"
99
organization.ror = "https://ror.org/048g3cy84"
1010
license.id = "CC-BY-4.0"
1111

12-
[technote.status]
13-
state = "draft"
14-
1512
[[technote.authors]]
1613
name.given = "Colin T."
1714
name.family = "Slater"
@@ -21,3 +18,7 @@ orcid = "https://orcid.org/0000-0002-0558-0521"
2118
name = "University of Washington"
2219
internal_id = "Washington"
2320
address = "Dept. of Astronomy, Box 351580, Seattle, WA 98195, USA"
21+
22+
[[technote.authors]]
23+
name.given = "Toby"
24+
name.family = "Jennings"

0 commit comments

Comments
 (0)