You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/architecture.md
+42-38Lines changed: 42 additions & 38 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,16 +5,38 @@
5
5
The sapporo-service is a FastAPI application that accepts WES API requests, prepares a run directory for each workflow execution, and delegates the actual workflow engine invocation to a shell script (`run.sh`). Each workflow engine runs inside its own Docker container, spawned as a sibling container via the host's Docker socket. See [Installation - Volume Mounts](installation.md#volume-mounts-docker-in-docker) for details on the DinD volume mount requirements.
The Python side (`run.py`) never calls a workflow engine directly. It prepares the run directory, writes all input files, then forks `run.sh` as a subprocess. All run data is persisted to the filesystem, with a SQLite index for fast listing.
When the sapporo process restarts (e.g., container recreation), any `run.sh` subprocesses from the previous instance are dead. Runs that were in a non-terminal state are now orphans — their `state.txt` still says `RUNNING` or `QUEUED`, but no process is driving them forward.
141
+
Detects runs stuck in `RUNNING`/`QUEUED` after a process restart and marks them as `SYSTEM_ERROR`.
120
142
121
-
At startup, **before**the SQLite index is built, `recover_orphaned_runs()` scans all run directories and transitions orphaned runs to `SYSTEM_ERROR`.
143
+
`reconcile_runs()` runs at startup (before`init_db()`) and periodically in the background (at the snapshot interval, default: 30 minutes). For each run in a non-terminal state, it reads `run.pid` and checks process liveness via `os.kill(pid, 0)`:
122
144
123
-
### Target States
145
+
| PID file | Process alive | Action |
146
+
|---|---|---|
147
+
| Present | Yes | Skip (running normally) |
148
+
| Present | No | Set `SYSTEM_ERROR` (reason: "process vanished") |
149
+
| Absent | N/A | Set `SYSTEM_ERROR` (reason: "no pid file") |
124
150
125
-
Runs in the following non-terminal states are recovered:
126
-
127
-
-`INITIALIZING`
128
-
-`QUEUED`
129
-
-`RUNNING`
130
-
-`PAUSED`
131
-
-`PREEMPTED`
132
-
-`CANCELING`
133
-
-`DELETING`
134
-
135
-
Runs in terminal states (`COMPLETE`, `EXECUTOR_ERROR`, `SYSTEM_ERROR`, `CANCELED`, `DELETED`) and `UNKNOWN` are left unchanged.
136
-
137
-
### Recovery Actions
138
-
139
-
For each orphaned run, the recovery process:
140
-
141
-
1. Sets `state.txt` to `SYSTEM_ERROR`
142
-
2. Writes the current timestamp to `end_time.txt`
143
-
3. Appends a descriptive message to `system_logs.json`
144
-
145
-
### Ordering
146
-
147
-
`recover_orphaned_runs()` runs before `init_db()` in the application lifespan, so the SQLite index reflects the corrected states from its first build.
151
+
Runs in terminal states (`COMPLETE`, `EXECUTOR_ERROR`, `SYSTEM_ERROR`, `CANCELED`, `DELETED`) and `UNKNOWN` are skipped. For each reconciled run, `state.txt` is set to `SYSTEM_ERROR`, the current timestamp is written to `end_time.txt`, and the reason is logged to `system_logs.json`.
148
152
149
153
## SQLite Index
150
154
151
-
The SQLite database (`sapporo.db`) is an **index**, not a data store. It is rebuilt at a configurable interval (default: 30 minutes) by scanning the run directories and can be deleted at any time without data loss. It exists solely to make `GET /runs` (list all runs) fast. Individual run queries (`GET /runs/{run_id}`) always read from the filesystem.
155
+
The SQLite database (`sapporo.db`) is an **index**, not a data store. It is rebuilt at a configurable interval (default: 30 minutes) by a background asyncio task that scans the run directories, and can be deleted at any time without data loss. It exists solely to make `GET /runs` (list all runs) fast. Individual run queries (`GET /runs/{run_id}`) always read from the filesystem.
0 commit comments