Skip to content

Commit 4829cda

Browse files
Merge pull request #227 from fabriziosestito/rfc/scanjob
docs(rfc): add scan trigger RFC
2 parents f1dea67 + ecfd37d commit 4829cda

1 file changed

Lines changed: 161 additions & 0 deletions

File tree

docs/rfc/0002_scan_trigger.md

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
| | |
2+
| :----------- | :------------------------------------------------------------- |
3+
| Feature Name | Scan trigger |
4+
| Start Date | 2025-05-30 |
5+
| Category | Architecture |
6+
| RFC PR | [#227](https://github.com/rancher-sandbox/sbombastic/pull/227) |
7+
| State | **ACCEPTED** |
8+
9+
# Summary
10+
11+
[summary]: #summary
12+
13+
This RFC introduces the `ScanJob` CRD and describes how a user or automated systems can trigger a scan on a `Registry`.
14+
This supersedes part of [RFC 0001 - Scanner architecture and design](./0001_scanner_architecture_and_design.md), removing the need for a separate `DiscoveryJob` CRD.
15+
16+
# Motivation
17+
18+
[motivation]: #motivation
19+
20+
We need a way for users and other actors to trigger scans on container registries through a declarative, Kubernetes-native API.
21+
The `ScanJob` CRD follows the same pattern as the native Kubernetes `batch/v1` `Job` resource: its status field will track scan progress and results, providing familiar Kubernetes-style observability for scan operations.
22+
This enables better integration with the Rancher UI, workflows, automation, and GitOps processes while leveraging standard Kubernetes tooling for management and monitoring.
23+
24+
## Examples / User Stories
25+
26+
[examples]: #examples
27+
28+
- As a user I want to manually trigger the execution of a scan configuration on demand.
29+
- As a user I want the system to automatically trigger scans on a registry periodically.
30+
- As a user I want the system to automatically trigger a scan when a new registry is created or an existing one is updated with new repositories.
31+
32+
# Detailed design
33+
34+
[design]: #detailed-design
35+
36+
This RFC replaces the idea of having separate "discovery" and "scan" jobs.
37+
From now on, "scan" means the full process: finding images in a registry, creating SBOMs, and checking for vulnerabilities.
38+
39+
## ScanJob CRD
40+
41+
To trigger a scan, we define the `ScanJob` custom resource, which serves as a trigger for scanning a specific `Registry` resource.
42+
43+
An example `ScanJob` manifest looks like this:
44+
45+
```yaml
46+
apiVersion: sbombastic.rancher.io/v1alpha1
47+
kind: ScanJob
48+
metadata:
49+
name: scanjob-example
50+
namespace: default
51+
spec:
52+
registry: example-registry # Name of the Registry resource (in the same namespace) to be scanned
53+
```
54+
55+
### ScanJob status
56+
57+
The `ScanJob` resource will include a status field to reflect the scan's progress and outcome. This status will contain:
58+
59+
- `conditions`: Represents detailed job conditions, similar to those used in Kubernetes Jobs, showing whether the scan completed successfully or encountered issues (`Complete`, `Failed`).
60+
- `imagesCount`: The number of images found in the registry during the scan.
61+
62+
Please refer to the [Kubernetes API conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties) for more information about the status conditions.
63+
64+
Example status:
65+
66+
```yaml
67+
status:
68+
imagesCount: 100
69+
conditions:
70+
- type: Complete
71+
status: "False"
72+
reason: "Processing"
73+
message: "Job in progress"
74+
- type: Failed
75+
status: "False"
76+
reason: "Processing
77+
message: "Job in progress"
78+
```
79+
80+
### Validation
81+
82+
Only one `ScanJob` can run against a `Registry` at a time. If a `ScanJob` is already in progress for a `Registry`, creating another one will be rejected.
83+
A `ValidatingWebhook` enforces this by checking existing `ScanJob` resources in the same namespace to ensure no conflicts occur.
84+
85+
### Reconciler
86+
87+
A `ScanJob` reconciler will be introduced to handle and manage the entire lifecycle of `ScanJob` resources.
88+
89+
## Triggering Scans flow
90+
91+
When a `ScanJob` is created, the following sequence of actions is triggered:
92+
93+
1. **The `ScanJob` reconciler** fetches the referenced `Registry` resource.
94+
2. **If the Registry is not found**, the reconciler marks the `ScanJob` as `Failed` with an appropriate message.
95+
3. **The ScanJob reconciler** adds the serialized `Registry` resource as an annotation on the `ScanJob`. This ensures the scan uses a consistent snapshot of the registry configuration.
96+
4. **The ScanJob reconciler** sends a message on the NATS queue to trigger the scan workflow.
97+
5. **The ScanJob reconciler** updates the `ScanJob` status to `InProgress`.
98+
6. **A worker** receives the message and starts the discovery process.
99+
7. **The worker** discovers images in the registry.
100+
8. **The worker** updates the `ScanJob` status field `ImagesCount` with the number of images found.
101+
9. **The worker** sends a NATS message for each image to trigger SBOM generation.
102+
10. **Workers** generate SBOMs and send messages to initiate vulnerability scans.
103+
11. **Workers** create a `VulnerabilityReport` resource for each image with the scan results.
104+
12. **The `VulnerabilityReport` reconciler** monitors the number of `VulnerabilityReport` resources and, once it matches `ImagesCount`, marks the `ScanJob` as `Complete`.
105+
106+
This design simplifies the architecture by retaining only the `ScanJob` and `VulnerabilityReport` reconcilers.
107+
Unlike the previous model, where the `Image` and `SBOM` reconcilers coordinated different stages of the scan, the worker now directly publishes follow-up jobs (e.g., SBOM generation, vulnerability scan) to the queue.
108+
This reduces the number of Kubernetes API interactions and streamlines the scanning workflow.
109+
110+
## Error handling
111+
112+
- Transient errors encountered during the scan process (such as network problems or registry downtime) will be automatically retried by both reconcilers and workers. Workers will use exponential backoff for these retries. If the scan continues to fail after multiple attempts, the `ScanJob` will be marked as `Failed` with a relevant error message.
113+
- For non-transient errors (like an invalid registry configuration), the `ScanJob` will be marked as `Failed` immediately, accompanied by a clear error message.
114+
115+
## Registry deletion
116+
117+
A finalizer will be added to the `Registry` resource to guarantee that deletion only proceeds once any ongoing `ScanJob` has either completed or failed.
118+
This ensures that scans are not interrupted mid-process, preserving the integrity of scan results and preventing orphaned resources.
119+
120+
## Garbage collection
121+
122+
A maximum of X ScanJob resources per registry will be retained in the system for auditing and historical purposes, with X being a configurable value.
123+
This logic could be effectively implemented within either the `ValidatingWebhook` or the `ScanJob` reconciler.
124+
125+
## Scheduled scans
126+
127+
The scan frequency is set in the `Registry` resource via the `spec.scanInterval` field.
128+
A new [`Runnable`](https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/manager#Runnable) will be implemented to regularly trigger scans for all registries.
129+
Using a ticker, the runnable will periodically examine each `Registry`’s `spec.scanInterval` and create a `ScanJob` if the time since the last scan exceeds the configured interval.
130+
This allows us to use the same resource and reonciliation logic for both manual and scheduled scans, simplifying the architecture.
131+
132+
# Drawbacks
133+
134+
[drawbacks]: #drawbacks
135+
136+
<!---
137+
Why should we **not** do this?
138+
139+
* obscure corner cases
140+
* will it impact performance?
141+
* what other parts of the product will be affected?
142+
* will the solution be hard to maintain in the future?
143+
--->
144+
145+
# Alternatives
146+
147+
[alternatives]: #alternatives
148+
149+
<!---
150+
- What other designs/options have been considered?
151+
- What is the impact of not doing this?
152+
--->
153+
154+
# Unresolved questions
155+
156+
[unresolved]: #unresolved-questions
157+
158+
<!---
159+
- What are the unknowns?
160+
- What can happen if Murphy's law holds true?
161+
-

0 commit comments

Comments
 (0)