Skip to content

Commit d5a4e88

Browse files
committed
Proposal: Image Acceleration(Apparate)
Signed-off-by: fanjiankong <[email protected]>
1 parent 7eb4e17 commit d5a4e88

File tree

3 files changed

+325
-0
lines changed

3 files changed

+325
-0
lines changed
202 KB
Loading
91.6 KB
Loading

proposals/new/Image-Acceleration.md

+325
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,325 @@
1+
2+
# Proposal: `Image Acceleration(Apparate)`
3+
4+
5+
6+
## Abstract
7+
8+
Provide a image acceleration mechanism and a sub-project to achieve it.
9+
10+
## Background
11+
12+
Nowadays, a large number of loads have been containerized, and different business scenarios have different requirements for container startup time. Offline computing and some online services that need to quickly increase computing resources (scaling groups) often require container startup quickly. In the entire cycle of container startup, the time of pulling images take up 70% or more. According to statistics, due to the large container image size of an offline computing service, each expansion of thousands of Pods takes up to 10 minutes. Image distribution has become a major obstacle to rapid and elastic container scaling.
13+
14+
## Motivation
15+
Currently, the harbor community supports preheating third-party p2p components in image acceleration, but p2p still needs to pull the full amount of images, which may waste a lot of traffic and bandwidth. Moreover, in large-scale p2p pull scenarios of public clouds, a large number of concurrent write IO (docker extract gzip layer) is also very stressful on the cloud disk, which requiring special tuning. In that case, we want to provide a sub project which is harbor native image acceleration.
16+
17+
## Goals
18+
- Make the Harbor project support image acceleration natively and enhance Harbor's capabilities.
19+
- The data and metadata in the image format are stored in the Harbor through OCI artifact.
20+
- Compatible with original and on-demand acceleration modes at the same node.
21+
22+
## Proposal
23+
At present, goharbor's main project focuses on the R & D and maintenance of enterprise-level container registry, and introduces the images acceleration functions such as P2P preheating. However, this is not enough to meet the user's need to accelerate image distribution. Image acceleration has a large number of features and requirements, as well as relatively independent code development and maintenance work, and not suitable for hosting in the goharbor main project repository.
24+
25+
Here, we suggested that in the goharbor community, a image acceleration subproject be proposed to manage maintenance project requirements, development progress, and engineering code.
26+
27+
In this section, we introduce Apparate, a new kind of image acceleration mechanism.
28+
29+
### Overview
30+
31+
#### Glossary
32+
33+
- Apparate: name of our image acceleration system.
34+
- Apparate-builder: client tool for build Apparate OCI artifact and push to Harbor or other data source.
35+
- Apparate-snapshotter: a Containerd snapshotter plugin for prepare container rootfs.
36+
- Apparate-fuse: a user space read-only file system, which can read image data on-demand.
37+
38+
Apparate introduces a new image format compatible with OCI standard to implement on-demand image pulling. In order to overcome some of the previous shortcomings of the tar format, Apparate uses the squashfs format as the data storage format for the image layer. In fact, we still retain the concept of layer and use layer as the minimum reuse unit.
39+
40+
We need to use three components to complete the entire process from image building to container operation. They are apparate-builder, apparate-fuse, and apparate-snapshotter.
41+
42+
### Story
43+
44+
**Kubernetes cluster administrator (Blake)**
45+
46+
- To use Apparate to accelerating image distribution, Blake should install a special containerd snapshotter for ever k8s work node.
47+
- Blake should setting the network to let the snapshotter cloud access the backend storage or Registry, like Harbor and TCR.
48+
49+
**Developer/Ops (Frank)**
50+
51+
- To use Apparate, Frank should install tool chains of it, like `Apparate-builder` , `Apparate-snapshotter`.
52+
- Frank should build a normal Docker Image, and then convert to `Apparate Special Artifact` via `Apparate-builder` layer by layer.
53+
- Frank could use `Apparate-builder` convert image and push to the Registry.
54+
55+
**CI/CD Pipeline**
56+
57+
- Could auto build, convert and push image.
58+
59+
**Image Registry**
60+
61+
- To storage and distribution `Apparate Special Artifact`, the registry should be support OCI artifacts, like Harbor and TCR.
62+
63+
### Architecture
64+
65+
![arch.png](../images/image-acceleration/arch.png)
66+
67+
Apparate-builder can build image from the container local storage or convert any image that conforms to the OCI format that supports Apparate acceleration, and then push it to Harbor.
68+
In order to improve the overall performance, we use the pre-overlay mechanism, that is, we will get the complete metadata view of the entire image when Apparate image has been converted and build. When Apparate-fuse is mounted, the final form of multiple image layers stacking is directly obtained. There is no need to perform overlay mounting.
69+
70+
Apparate-fuse is the core component to achieve on-demand pulling image. Apparate-snapshotter will mount Apparate-fuse when processing Apparate format artifact, which will load the pre-processed metadata information(file inode of layer and image layout) in Apparate artifact, and then provide it as Rootfs to the container starting. Therefore, when the container starts, the request for image data will be translated by apparate-fuse into a request for different data blocks, so that the image can be downloaded on demand at the remote data source, which greatly improves the startup speed of the container.
71+
72+
Apparate-snapshoter is a storage plugin of containerd, which can create on-demand image rootfs containers based on the type of artifacts in the Manifest or create containers that use the original method of local storage.
73+
74+
### Key features
75+
76+
1. Multiple data source: Registry, S3 and filesystem
77+
2. Local file system cache
78+
3. OCI image spec compatible
79+
4. High Reliability: recover fuse process in running container rootfs
80+
5. Prefetch data block support
81+
6. Prepare metadata overlay
82+
7. Data block checksum
83+
8. Integrated with P2P system
84+
85+
### Workflow
86+
![workflow1.png](../images/image-acceleration/workflow.png)
87+
1. Containerd use apparate-snapshotter to execute pod creating command or image pulling command.
88+
89+
2. Apparate-snapshotter will use apparate-fuse to construct container rootfs.
90+
91+
3. Apparate-fuse pull the merge-meta layer which superblock stored in. Then apparate-fuse will know the mapping between files and different layers and blocks.
92+
93+
4. Apparate supports Harbor/registry, object storage and local posix file system as the backend data storage layer.
94+
95+
5. Containerd create container(pod) in virtual read-only rootfs which provided from apparate-fuse, and use a overlayfs to mount a read/write upper layer for container.
96+
97+
6. When the container started, apparate-fuse as a file system will receive the actual read requests, and find out whether there is a corresponding data block cached locally, and return data from the local cache firstly. If data block has not been cached, it will read block from data source.
98+
99+
### Design
100+
#### Layout
101+
102+
```
103+
+-----------+----------+-----------+---------------+--------------+-------------+-----------+
104+
| | | | | | | |
105+
|SuperBlock |BlobTable |InodeTable |DirectoryTable |FragmentTable |UID/GIDTable |XattrTable |
106+
| | | | | (optional) | | |
107+
+-----------+----------+-----------+---------------+--------------+-------------+-----------+
108+
```
109+
- SuperBlock, the first section of a apparate fs archive, and contains important information about the archive, including the locations of other sections of the archive.
110+
- BlobTable, record a list of blob ids of all layers which the archive contains.
111+
- Inode table, contain all inodes.
112+
- DirectoryTable, record all inodes info in a directory.
113+
- FragmentTable, record the location and size of these fragment blocks.if the size of a file is not equally divisible by block_size, the final chunk can be stored in a fragment.
114+
- UID/GIDTable, record uid and gid.
115+
- XattrTable, record extended attributes key value pairs attaced to inodes.
116+
- In Apparate, blob data is stored in a number of data blocks, which are stored sequential.All data blocks must be of common size and then compressed.
117+
118+
#### Superblock
119+
```
120+
type SuperBlock struct {
121+
Magic uint32
122+
InodeCount uint32
123+
BlockSize uint32
124+
FragmentEntryCount uint32
125+
CompressionId uint16
126+
Flags uint16
127+
IdCount uint16
128+
RootInodeRef uint64
129+
// Total size of the file system in bytes, not counting padding.
130+
BytesUsed uint64
131+
IdTableStart uint64
132+
XattrIdTableStart uint64
133+
InodeTableStart uint64
134+
DirectoryTableStart uint64
135+
FragmentTableStart uint64
136+
BlobTableStart uint64
137+
}
138+
```
139+
- CompressionId offers several flags to choose which compression algorithm.1 - GZIP;3 - LZO;5-LZ4.
140+
- InodeTableStart is the byte offset at which the inode table starts.
141+
- RootInodeRef is an reference to the inode of the root directory of the archive.
142+
#### Inode table
143+
The inode table starts at inode_table_start and ends at directory_table_start,containing all inodes. Each entry in the table has a common inode header and the real inode type struct.
144+
##### Inode header and inode
145+
146+
```
147+
type InodeHeader struct {
148+
Type uint16
149+
Mode uint64
150+
UidIdx uint16
151+
GidIdx uint16
152+
// Blob index in blob table
153+
BlobIdx uint16
154+
// The position of this inode in the full list of inodes
155+
InodeNumber uint16
156+
}
157+
158+
type Inode struct {
159+
Header *InodeHeader
160+
// inode type struct
161+
Data interface{}
162+
Extra []uint32
163+
}
164+
165+
```
166+
- The type of item described by the inode which follows this header.
167+
- BlobIdx, blob index in blob table.
168+
- Extral in inode, for regular file inodes, this is an array of compressed block sizes.For symlink inodes, this is actually a string holding the target.
169+
##### Inode type
170+
The data in inode could be file or directory and so on.The types are shown below.
171+
| type | description |
172+
| ------ | ------ |
173+
| 1 | basic directory |
174+
| 2 | basic file |
175+
| 3 | basic symlink |
176+
| 4 | basic block device |
177+
| 5 | basic chart device |
178+
| 6 | basic fifo |
179+
| 7 | basic socket |
180+
##### basic file inode
181+
182+
```
183+
type InodeBasicFile struct {
184+
BlockStart uint32
185+
FragmentIndex uint32
186+
FragmentOffset uint32
187+
FileSize uint32
188+
}
189+
```
190+
- BlockStart, absolute position of the first compressed data block.
191+
- FragmentIndex, index into the fragment table.
192+
- FragmentOffset, offset into the uncompressed fragment block.
193+
- FilseSize, uncompressed size of the file in bytes.
194+
##### basic directory inode
195+
196+
```
197+
type InodeDirectory struct {
198+
StartBlock uint32
199+
Nlink uint32
200+
Size uint16
201+
Offset uint16
202+
ParentNode uint32
203+
}
204+
```
205+
- StartBlock, offset from the directory table start to the location of the first directory header.
206+
- Nlink,bumber of hard links to this node.
207+
- Size,combined size of all directory entries and headers in bytes.
208+
- Offset,offset into the uncompressed start block where the header can be found.
209+
- ParentNode, inode number of the parent directory.
210+
#### Directory table
211+
For each directory inode, the directory table stores a list of all entries stored inside, with references back to the inodes that describe those entries. Each table entry has header and entry list
212+
213+
```
214+
+----------------+------+------+------+----------------+------+------+------+
215+
| | | | | | | | |
216+
|DirectoryHeader |Entry |Entry | ... |DirectoryHeader |Entry |Entry | ... |
217+
| | | | | | | | |
218+
+----------------+------+------+------+----------------+------+------+------+
219+
```
220+
221+
##### Diretory Header
222+
223+
```
224+
type DirHeader struct {
225+
Count uint32
226+
StartBlock uint32
227+
InodeNumber uint32
228+
}
229+
```
230+
- Count, the number of entryies that are following.
231+
- StartBlock, the location of the the inodes for the entries that follow, relative to the start of the inode table.
232+
- InodeNumber, the inode number of the first entry.
233+
##### Entry
234+
235+
```
236+
type DirectoryEntry struct {
237+
Offset uint16
238+
InodeDiff int16
239+
Type uint16
240+
Size uint16
241+
Name []uint8
242+
}
243+
```
244+
- Offset, offset into the uncompressed meta data block containing the coresponding inode.
245+
- InodeDiff, difference of the inode number from the one in header.
246+
- Size, the size of the entry name.
247+
- Name, the name of the directory entry.
248+
#### Blob Table
249+
Blob table is the mapping from blob index of inode blob_idx to blob id.
250+
251+
```
252+
+-------+-------+-------+-------+
253+
| | | | |
254+
|BlobId |BlobId |BlobId | ... |
255+
| | | | |
256+
+-------+-------+-------+-------+
257+
```
258+
259+
```
260+
type BlobTable struct {
261+
BlobIds []uint32
262+
}
263+
```
264+
#### Fragment Table
265+
```
266+
type FragmentEntry struct {
267+
StartOffset uint64
268+
Size uint32
269+
}
270+
```
271+
- StartOffset, location of the fragment on-disk data block.
272+
- Size, size of the fragment in bytes.
273+
#### Artifact Formart
274+
275+
A typical image manifest of Apparate OCI manifest consists:
276+
277+
- a config.json
278+
- one Apparate metadata layer ("mediaType": "application/vnd.oci.image.layer.v1.tar.gz")
279+
- one or more Apparate blob layers ("mediaType": "application/vnd.oci.image.layer.apparate.blob.v1").
280+
281+
```json
282+
{
283+
"schemaVersion": 2,
284+
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
285+
"config": {
286+
"mediaType": "application/vnd.docker.container.image.v1+json",
287+
"size": 1507,
288+
"digest": "sha256:f70734b6a266dcb5f44c383274821207885b549b75c8e119404917a61335981a"
289+
},
290+
"layers": [
291+
{
292+
"mediaType": "application/vnd.oci.image.layer.apparate.blob.v1",
293+
"size": 2813316,
294+
"digest": "sha256:cbdbe7a5bc2a134ca8ec91be58565ec07d037386d1f1d8385412d224deafca08",
295+
"annotations": {
296+
"containerd.io/snapshot/apparate-blob": "true"
297+
}
298+
},
299+
{
300+
"mediaType": "application/vnd.oci.image.layer.apparate.blob.v1",
301+
"size": 351664,
302+
"digest": "sha256:014436560cf54889d20b22267e763a95012094c05bab16dd5af8119df9f2b00b",
303+
"annotations": {
304+
"containerd.io/snapshot/apparate-blob": "true"
305+
}
306+
},
307+
{
308+
"mediaType": "application/vnd.oci.image.layer.v1.tar.gz",
309+
"size": 5930,
310+
"digest": "sha256:6c098fc9cc8bc5c0acfb5e34ca1ca7c522a8cd7a90dd8464459d46b8a8aa3ff3",
311+
"annotations": {
312+
"containerd.io/snapshot/apparate-merge": "true"
313+
}
314+
}
315+
]
316+
}
317+
```
318+
319+
## Compatibility
320+
321+
Apparate is compatible with OCI spec and distribution spec, so it should be pushed to harbor.
322+
323+
## Reference
324+
325+
1.Slacker: Fast Distribution with Lazy Docker Containers https://www.usenix.org/system/files/conference/fast16/fast16-papers-harter.pdf

0 commit comments

Comments
 (0)