Skip to content

Commit 9011880

Browse files
Rakshith-Rmergify[bot]
authored andcommitted
doc: add design proposal for K8s SA based volume access restriction
Signed-off-by: Rakshith R <[email protected]>
1 parent 6102f1d commit 9011880

File tree

1 file changed

+167
-0
lines changed

1 file changed

+167
-0
lines changed
Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
# Kubernetes ServiceAccount Based Volume Access Restriction
2+
3+
## Introduction
4+
5+
This proposal introduces an optional mechanism to restrict volume access based
6+
on the Kubernetes ServiceAccount of the Pod mounting the volume. When
7+
configured, only Pods running with one of the specified
8+
ServiceAccounts are allowed to mount the volume. All
9+
other mount attempts are rejected with a
10+
`PermissionDenied` error.
11+
12+
The restriction is stored as metadata on the backend Ceph object (RBD image
13+
metadata or CephFS subvolume metadata) and is enforced at mount time through
14+
the CSI [`podInfoOnMount`][pod-info-on-mount] mechanism.
15+
This design depends solely on Kubernetes providing the Pod's ServiceAccount
16+
name in the volume context during `NodePublishVolume` calls via the standard
17+
key `csi.storage.k8s.io/serviceAccount.name: {pod.Spec.ServiceAccountName}`.
18+
No other validation is performed on the ServiceAccount name.
19+
20+
[pod-info-on-mount]:
21+
<https://kubernetes-csi.github.io/docs/pod-info.html#pod-info-on-mount-with-csi-driver-object>
22+
23+
## Motivation
24+
25+
Ceph-CSI volumes are accessible to any Pod that has a valid PVC reference and
26+
the necessary RBAC to use the StorageClass. In multi-tenant and data pipeline
27+
environments, this is insufficient. There are scenarios where a volume should
28+
be exclusively accessible to a specific workload identity even when other Pods
29+
in the same namespace can reference the PVC.
30+
31+
### Use Case: Ceph VolSync Plugin Replication Destination PVC Protection
32+
33+
A primary motivator for this feature is the custom
34+
[Ceph VolSync Plugin](https://github.com/RamenDR/ceph-volsync-plugin) that
35+
performs incremental data replication across clusters. In a disaster recovery
36+
or migration workflow:
37+
38+
1. A `ReplicationDestination` controller creates a PVC on the destination
39+
cluster to receive replicated data.
40+
1. A replication worker Pod, running under a dedicated ServiceAccount (e.g.
41+
`volsync-worker-sa`), incrementally syncs data from the source cluster into
42+
this destination PVC.
43+
1. The destination PVC must remain writable only by the replication worker
44+
until the replication is complete and a failover is triggered.
45+
46+
Without ServiceAccount based restriction, any Pod in the namespace with a
47+
reference to the destination PVC could write to it, potentially corrupting the
48+
replicated data or breaking the incremental sync state. By binding the
49+
destination volume to the replication worker's ServiceAccount, the volume is
50+
protected from unintended writes throughout the replication lifecycle. On
51+
failover, the restriction is removed so the application workload can mount
52+
the volume.
53+
54+
### Other Potential Use Cases
55+
56+
- **Sensitive data volumes**: Restrict access to volumes containing regulated
57+
data to only the ServiceAccount authorized to process them.
58+
- **Custom usecases**: Similar usecases where a
59+
workload identity needs exclusive access to a volume
60+
for data integrity or security reasons.
61+
62+
## Dependency
63+
64+
- The [`podInfoOnMount`][pod-info-on-mount] field must
65+
be set to `true` in the CSIDriver specification.
66+
This causes Kubelet to inject Pod information
67+
(including the ServiceAccount name) into the volume
68+
context during `NodePublishVolume`. Without this,
69+
the restriction cannot be enforced. Since this
70+
parameter is a mutable field in the CSIDriver spec,
71+
it will be enabled by default going
72+
forward(cephcsi v3.17.0+).
73+
74+
## Design
75+
76+
### Metadata Keys
77+
78+
Each driver type uses a driver-specific metadata key to
79+
store the allowed ServiceAccount name(s) as a
80+
comma-separated list (e.g. `sa1,sa2,sa3`):
81+
82+
| Driver | Metadata Key | Storage |
83+
|--------|-------------|---------|
84+
| RBD | `.rbd.csi.ceph.com/serviceaccount` | RBD image metadata |
85+
| CephFS | `.cephfs.csi.ceph.com/serviceaccount` | CephFS subvolume metadata |
86+
| NVMe-oF | `.rbd.csi.ceph.com/serviceaccount` | RBD image metadata (via RBD backend) |
87+
| NFS | `.cephfs.csi.ceph.com/serviceaccount` | CephFS subvolume metadata (via CephFS backend) |
88+
89+
One or more ServiceAccounts can be specified per volume,
90+
separated by commas.
91+
92+
### CSI Flow
93+
94+
The restriction is enforced across two CSI RPCs:
95+
96+
1. **ControllerPublishVolume**: The controller reads the ServiceAccount
97+
metadata from the Ceph backend. If present, it is included in the publish
98+
context passed to the node.
99+
100+
1. **NodePublishVolume**: The node plugin splits the
101+
publish context value on commas and checks whether
102+
the Pod's ServiceAccount (provided by Kubelet via
103+
`csi.storage.k8s.io/serviceAccount.name` in the
104+
volume context) matches any entry. A mismatch
105+
results in a `PermissionDenied` error. If no
106+
restriction is set, or if `podInfoOnMount` is not
107+
enabled, the mount is allowed.
108+
109+
### Implementation
110+
111+
A shared validation function `ValidateServiceAccountRestriction` in
112+
`internal/util/validate.go` is called at the beginning of `NodePublishVolume`
113+
in all four drivers (RBD, CephFS, NFS, NVMe-oF), ensuring consistent
114+
enforcement.
115+
116+
Each driver reads the restriction metadata in `ControllerPublishVolume` using
117+
its backend:
118+
119+
- **RBD**: reads via `GetMetadata` in `internal/rbd/controllerserver.go`.
120+
- **CephFS**: reads via `ListMetadata` in
121+
`internal/cephfs/controllerserver.go`.
122+
- **NVMe-oF**: delegates to the RBD backend and propagates the publish context
123+
in `internal/nvmeof/controller/controllerserver.go`.
124+
- **NFS**: delegates to the CephFS backend in
125+
`internal/nfs/controller/controllerserver.go`.
126+
127+
## Setting and Removing the Restriction
128+
129+
The restriction is managed through Ceph CLI commands. Refer to the
130+
"Kubernetes ServiceAccount Based Volume Access" sections in
131+
[RBD deploy.md](../../rbd/deploy.md) and
132+
[CephFS deploy.md](../../cephfs/deploy.md) for usage
133+
instructions and examples.
134+
135+
## Ceph VolSync Plugin Integration Example
136+
137+
1. The replication destination worker sets the
138+
ServiceAccount restriction on the backing Ceph
139+
object(RBD image or CephFS subvolume) to the
140+
replication worker's ServiceAccount
141+
(e.g.`volsync-worker-sa`) on first use.
142+
1. Only the worker Pod mounts the destination PVC successfully because its
143+
ServiceAccount matches. Any other Pod attempting to mount the same PVC is
144+
rejected with `PermissionDenied` during NodePublish call, protecting data
145+
integrity during incremental sync.
146+
1. On replication destination deletion, the controller spins up a cleanup job
147+
that removes the ServiceAccount restriction
148+
metadata, allowing the application workload to
149+
mount the volume.
150+
151+
## Limitations
152+
153+
- Enforced at CSI mount time only; does not prevent direct access to the
154+
underlying Ceph storage from outside Kubernetes.
155+
- If `podInfoOnMount` is not enabled, the restriction is silently unenforced.
156+
- Changing the restriction on an already-mounted volume does not affect
157+
existing mounts. The volume must be unmounted and remounted.
158+
- Managed through Ceph CLI commands, not Kubernetes-native APIs.
159+
160+
## Future Enhancements
161+
162+
- Support restriction based on other attributes (e.g. name, namespace) in
163+
addition to ServiceAccount.
164+
- Provide more flexible configuration key value options (e.g. receiving both
165+
expected key-value pairs in the volume context instead of a single
166+
ServiceAccount name).
167+
- Support restriction for static volumes.

0 commit comments

Comments
 (0)