You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Refactor SXM for SMAPIv1 code to fit the new code structure (xapi-project#6423)
This is a continuation of xapi-project#6404, which completes the refactoring of SXM
code for the new architecture.
I expect there to be no functional change, although there is a
significant change of how error handling is done.
The last couple of commits contain the design for this PR.
To be continued...
The core idea of storage migration is surprisingly simple: We have VDIs attached to a VM,
33
+
and we wish to migrate these VDIs from one SR to another. This necessarily requires
34
+
us to copy the data stored in these VDIs over to the new SR, which can be a long-running
35
+
process if there are gigabytes or even terabytes of them. We wish to minimise the
36
+
down time of this process to allow the VM to keep running as much as possible.
37
+
38
+
At a very high level, the SXM process generally only consists of two stages: preparation
39
+
and mirroring. The preparation is about getting the receiving host ready for the
40
+
mirroring operation, while the mirroring itself can be further divided into two
41
+
more operations: 1. sending new writes to both sides; 2.copying existing data from
42
+
source to destination. The exact detail of how to set up a mirror differs significantly
43
+
between SMAPIv1 and SMAPIv3, but both of them will have to perform the above two
44
+
operations.
45
+
Once the mirroring is established, it is a matter of checking the status of the
46
+
mirroring and carry on with the follwoing VM migration.
47
+
48
+
The reality is more complex than what we had hoped for. For example, in SMAPIv1,
49
+
the mirror establishment is quite an involved process and is itself divided into
50
+
several stages, which will be discussed in more detail later on.
51
+
52
+
53
+
## SXM Multiplexing
54
+
55
+
This section is about the design idea behind the additional layer of mutiplexing specifically
56
+
for Storage Xen Motion (SXM) from SRs using SMAPIv3. It is recommended that you have read the
57
+
[introduction doc](_index.md) for the storage layer first to understand how storage
58
+
multiplexing is done between SMAPIv2 and SMAPI{v1, v3} before reading this.
59
+
60
+
61
+
### Motivation
62
+
63
+
The existing SXM code was designed to work only with SMAPIv1 SRs, and therefore
64
+
does not take into account the dramatic difference in the ways SXM is done between
65
+
SMAPIv1 and SMAPIv3. The exact difference will be covered later on in this doc, for this section
66
+
it is sufficient to assume that they have two ways of doing migration. Therefore,
67
+
we need different code paths for migration from SMAPIv1 and SMAPIv3.
68
+
69
+
#### But we have storage_mux.ml
70
+
71
+
Indeed, storage_mux.ml is responsible for multiplexing and forwarding requests to
72
+
the correct storage backend, based on the SR type that the caller specifies. And
73
+
in fact, for inbound SXM to SMAPIv3 (i.e. migrating into a SMAPIv3 SR, GFS2 for example),
74
+
storage_mux is doing the heavy lifting of multiplexing between different storage
75
+
backends. Every time a `Remote.` call is invoked, this will go through the SMAPIv2
76
+
layer to the remote host and get multiplexed on the destination host, based on
77
+
whether we are migrating into a SMAPIv1 or SMAPIv3 SR (see the diagram below).
78
+
And the inbound SXM is implemented
79
+
by implementing the existing SMAPIv2 -> SMAPIv3 calls (see `import_activate` for example)
80
+
which may not have been implemented before.
81
+
82
+

83
+
84
+
While this works fine for inbound SXM, it does not work for outbound SXM. A typical SXM
85
+
consists of four combinations, the source sr type (v1/v3) and the destiantion sr
86
+
type (v1/v3), any of the four combinations is possible. We have already covered the
87
+
destination multiplexing (v1/v3) by utilising storage_mux, and at this point we
88
+
have run out of multiplexer for multiplexing on the source. In other words, we
89
+
can only mutiplex once for each SMAPIv2 call, and we can either use that chance for
90
+
either the source or the destination, and we have already used it for the latter.
91
+
92
+
93
+
#### Thought experiments on an alternative design
94
+
95
+
To make it even more concrete, let us consider an example: the mirroring logic in
96
+
SXM is different based on the source SR type of the SXM call. You might imagine
97
+
defining a function like `MIRROR.start v3_sr v1_sr` that will be multiplexed
98
+
by the storage_mux based on the source SR type, and forwarded to storage_smapiv3_migrate,
99
+
or even just xapi-storage-script, which is indeed quite possible.
100
+
Now at this point we have already done the multiplexing, but we still wish to
101
+
multiplex operations on destination SRs, for example, we might want to attach a
102
+
VDI belonging to a SMAPIv1 SR on the remote host. But as we have already done the
103
+
multiplexing and is now inside xapi-storage-script, we have lost any chance of doing
104
+
any further multiplexing :(
105
+
106
+
### Design
107
+
108
+
The idea of this new design is to introduce an additional multiplexing layer that
109
+
is specific for multiplexing calls based on the source SR type. For example, in
110
+
the diagram below the `send_start src_sr dest_sr` will take both the src SR and the
111
+
destination SR as parameters, and suppose the mirroring logic is different for different
112
+
types of source SRs (i.e. SMAPIv1 or SMAPIv3), the storage migration code will
113
+
necessarily choose the right code path based on the source SR type. And this is
114
+
exactly what is done in this additional multiplexing layer. The respective logic
115
+
for doing {v1,v3}-specifi mirroring, for example, will stay in storage_smapi{v1,v3}_migrate.ml
116
+
117
+

118
+
119
+
Note that later on storage_smapi{v1,v3}_migrate.ml will still have the flexibility
120
+
to call remote SMAPIv2 functions, such as `Remote.VDI.attach dest_sr vdi`, and
121
+
it will be handled just as before.
122
+
123
+
## SMAPIv1 migration
124
+
125
+
At a high level, mirror establishment for SMAPIv1 works as follows:
126
+
127
+
1. Take a snapshot of a VDI that is attached to VM1. This gives us an immutable
128
+
copy of the current state of the VDI, with all the data until the point we took
129
+
the snapshot. This is illustrated in the diagram as a VDI and its snapshot connecting
130
+
to a shared parent, which stores the shared content for the snapshot and the writable
131
+
VDI from which we took the snapshot (snapshot)
132
+
2. Mirror the writable VDI to the server hosts: this means that all writes that goes to the
133
+
client VDI will also be written to the mirrored VDI on the remote host (mirror)
134
+
3. Copy the immutable snapshot from our local host to the remote (copy)
135
+
4. Compose the mirror and the snapshot to form a single VDI
136
+
5. Destroy the snapshot on the local host (cleanup)
137
+
138
+
139
+
more detail to come...
140
+
141
+
## SMAPIv3 migration
142
+
143
+
More detail to come...
144
+
145
+
## Error Handling
146
+
147
+
Storage migration is a long-running process, and is prone to failures in each
148
+
step. Hence it is important specifying what errors could be raised at each step
149
+
and their significance. This is beneficial both for the user and for triaging.
150
+
151
+
There are two general cleanup functions in SXM: `MIRROR.receive_cancel` and
152
+
`MIRROR.stop`. The former is for cleaning up whatever has been created by `MIRROR.receive_start`
153
+
on the destination host (such as VDIs for receiving mirrored data). The latter is
154
+
a more comprehensive function that attempts to "undo" all the side effects that
155
+
was done during the SXM, and also calls `receive_cancel` as part of its operations.
156
+
157
+
Currently error handling was done by building up a list of cleanup functions in
158
+
the `on_fail` list ref as the function executes. For example, if the `receive_start`
159
+
has been completed successfully, add `receive_cancel` to the list of cleanup functions.
160
+
And whenever an exception is encountered, just execute whatever has been added
161
+
to the `on_fail` list ref. This is convenient, but does entangle all the error
162
+
handling logic with the core SXM logic itself, making the code rather than hard
163
+
to understand and maintain.
164
+
165
+
The idea to fix this is to introduce explicit "stages" during the SXM and define
166
+
explicitly what error handling should be done if it fails at a certain stage. This
167
+
helps separate the error handling logic into the `with` part of a `try with` block,
168
+
which is where they are supposed to be. Since we need to accommodate the existing
169
+
SMAPIv1 migration (which has more stages than SMAPIv3), the following stages are
170
+
introduced: preparation (v1,v3), snapshot(v1), mirror(v1, v3), copy(v1). Note that
171
+
each stage also roughly corresponds to a helper function that is called within `MIRROR.start`,
172
+
which is the wrapper function that initiates storage migration. And each helper
173
+
functions themselves would also have error handling logic within themselves as
174
+
needed (e.g. see `Storage_smapiv1_migrate.receive_start) to deal with exceptions
175
+
that happen within each helper functions.
176
+
177
+
### Preparation (SMAPIv1 and SMAPIv3)
178
+
179
+
The preparation stage generally corresponds to what is done in `receive_start`, and
180
+
this function itself will handle exceptions when there are partial failures within
181
+
the function itself, such as an exception after the receiving VDI is created.
182
+
It will use the old-style `on_fail` function but only with a limited scope.
183
+
184
+
There is nothing to be done at a higher level (i.e within `MIRROR.start` which
185
+
calls `receive_start`) if preparation has failed.
186
+
187
+
### Snapshot and mirror failure (SMAPIv1)
188
+
189
+
For SMAPIv1, the mirror is done in a bit cumbersome way. The end goal is to establish
190
+
connections between two tapdisk processes on the source and destination hosts.
191
+
To achieve this goal, xapi will do two main jobs: 1. create a connection between two
192
+
hosts and pass the connection to tapdisk; 2. create a snapshot as a starting point
193
+
of the mirroring process.
194
+
195
+
Therefore handling of failures at these two stages are similar: clean up what was
196
+
done in the preparation stage by calling `receive_cancel`, and that is almost it.
197
+
Again, we will leave whatever is needed for partial failure handling within those
198
+
functions themselves and only clean up at a stage-level in `storage_migrate.ml`
199
+
200
+
Note that `receive_cancel` is a multiplexed function for SMAPIv1 and SMAPIv3, which
201
+
means different clean up logic will be executed depending on what type of SR we
202
+
are migrating from.
203
+
204
+
### Mirror failure (SMAPIv3)
205
+
206
+
To be filled...
207
+
208
+
### Copy failure (SMAPIv1)
209
+
210
+
The final step of storage migration for SMAPIv1 is to copy the snapshot from the
211
+
source to the destination. At this stage, most of the side effectful work has been
212
+
done, so we do need to call `MIRROR.stop` to clean things up if we experience an
213
+
failure during copying.
214
+
215
+
216
+
## SMAPIv1 Migration implementation detail
217
+
218
+
```mermaid
8
219
sequenceDiagram
9
220
participant local_tapdisk as local tapdisk
10
221
participant local_smapiv2 as local SMAPIv2
@@ -129,7 +340,7 @@ opt post_detach_hook
129
340
end
130
341
Note over xapi: memory image migration by xenopsd
131
342
Note over xapi: destroy the VM record
132
-
{{< /mermaid >}}
343
+
```
133
344
134
345
### Receiving SXM
135
346
@@ -162,7 +373,7 @@ the receiving end of storage motion:
162
373
163
374
This is how xapi coordinates storage migration. We'll do it as a code walkthrough through the two layers: xapi and storage-in-xapi (SMAPIv2).
164
375
165
-
## Xapi code
376
+
###Xapi code
166
377
167
378
The entry point is in [xapi_vm_migration.ml](https://github.com/xapi-project/xen-api/blob/f75d51e7a3eff89d952330ec1a739df85a2895e2/ocaml/xapi/xapi_vm_migrate.ml#L786)
168
379
@@ -1056,7 +1267,7 @@ We also try to remove the VM record from the destination if we managed to send i
1056
1267
Finally we check for mirror failure in the task - this is set by the events thread watching for events from the storage layer, in [storage_access.ml](https://github.com/xapi-project/xen-api/blob/f75d51e7a3eff89d952330ec1a739df85a2895e2/ocaml/xapi/storage_access.ml#L1169-L1207)
1057
1268
1058
1269
1059
-
## Storage code
1270
+
###Storage code
1060
1271
1061
1272
The part of the code that is conceptually in the storage layer, but physically in xapi, is located in
1062
1273
[storage_migrate.ml](https://github.com/xapi-project/xen-api/blob/f75d51e7a3eff89d952330ec1a739df85a2895e2/ocaml/xapi/storage_migrate.ml). There are logically a few separate parts to this file:
@@ -1069,7 +1280,7 @@ The part of the code that is conceptually in the storage layer, but physically i
1069
1280
1070
1281
Let's start by considering the way the storage APIs are intended to be used.
1071
1282
1072
-
### Copying a VDI
1283
+
####Copying a VDI
1073
1284
1074
1285
`DATA.copy` takes several parameters:
1075
1286
@@ -1119,7 +1330,7 @@ The implementation uses the `url` parameter to make SMAPIv2 calls to the destina
1119
1330
The implementation tries to minimize the amount of data copied by looking for related VDIs on the destination SR. See below for more details.
1120
1331
1121
1332
1122
-
### Mirroring a VDI
1333
+
####Mirroring a VDI
1123
1334
1124
1335
`DATA.MIRROR.start` takes a similar set of parameters to that of copy:
1125
1336
@@ -1156,11 +1367,11 @@ Note that state is a list since the initial phase of the operation requires both
1156
1367
1157
1368
Additionally the mirror can be cancelled using the `MIRROR.stop` API call.
1158
1369
1159
-
### Code walkthrough
1370
+
####Code walkthrough
1160
1371
1161
1372
let's go through the implementation of `copy`:
1162
1373
1163
-
#### DATA.copy
1374
+
#####DATA.copy
1164
1375
1165
1376
```ocaml
1166
1377
let copy ~task ~dbg ~sr ~vdi ~dp ~url ~dest =
@@ -1296,7 +1507,7 @@ Finally we snapshot the remote VDI to ensure we've got a VDI of type 'snapshot'
1296
1507
1297
1508
The exception handler does nothing - so we leak remote VDIs if the exception happens after we've done our cloning :-(
1298
1509
1299
-
#### DATA.copy_into
1510
+
#####DATA.copy_into
1300
1511
1301
1512
Let's now look at the data-copying part. This is common code shared between `VDI.copy`, `VDI.copy_into` and `MIRROR.start` and hence has some duplication of the calls made above.
1302
1513
@@ -1467,7 +1678,7 @@ The last thing we do is to set the local and remote content_id. The local set_co
1467
1678
Here we perform the list of cleanup operations. Theoretically. It seems we don't ever actually set this to anything, so this is dead code.
0 commit comments