You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/chunk-deduplication.md
+45-31Lines changed: 45 additions & 31 deletions
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,8 @@
1
-
# Notice [WIP] Pending further revisionsNotice
2
-
# Probntroduction
1
+
# Probntroduction
3
2
In container images, there are often a large number of duplicate files or content, and these duplicate parts occupy a large amount of storage space, especially in high-density deployment scenarios. As the number of Nydus images grows, it will bring many problems such as low storage space utilization and excessive consumption of bandwidth resources. To do this, an effective deduplication mechanism (deduplication) needs to be designed to solve this problem.
4
3
5
4
Unlike traditional OCI, which distributes images at a layer-granular level, the smallest unit of a Nydus image is a chunk, so the deduplication algorithm needs to be deduplicated in chunk units. At the same time, we want to deduplicate multiple aspects of the Nydus image, including between Nydus images and between different versions of the same Nydus image. No matter which deduplication method is essentially to deduplicate the repeated chunks in the image, only one duplicate chunk is retained, and the reference to the chunk is used instead of other duplicate chunks to reduce the storage space occupation, so as to maximize the data transmission and storage capabilities of Nydus and improve the access speed and efficiency of the image.
5
+
6
6
# General idea
7
7
The deduplication algorithm first needs to select the duplicate chunk in the image according to the image information such as the number of occurrences of chunk, chunk size, chunk image to which the chunk belongs and the corresponding version, and generate chunkdict, chunkdict records the unique identifier or fingerprint of chunk, only need to store chunkdict, other images can refer to chunk in chunkdict by reference.
8
8
@@ -13,32 +13,43 @@ The deduplication algorithm is divided into two parts, the first part is the DBS
13
13
2. Extract the image information and call the DBSCAN clustering algorithm to deduplicate different images.
14
14
3. Deduplicate the dictionary content in 2, and call the exponential smoothing algorithm for each image separately for image version deduplication.
15
15
4. Get the deduplication dictionary generated by running the two algorithms and drop the disk.
16
+
5. Generate a chunkdict image and push it to the remote repository
# Use the chunk dict image to reduce the incremental size of the new image
33
+
```
34
+
nydusify convert
35
+
--source registry.com/redis:OCI_7.0.4 \
36
+
--target registry.com/redis:nydus_7.0.4 \
37
+
--chunk-dict registry.com/redis:nydus_chunkdict
24
38
```
25
-
***
26
-
`nydusify chunkdict generate` calls two commands `nydus-image chunkdict save` and `nydus-image chunkdict generate` to store image information into the database and generate a list of chunks to be deduplicated
27
39
28
-
Download multiple Nydus images in advance and put them into the repository as datasets, such as selecting 10 consecutive versions of redis and alpine as the image dataset, and execute the command `nydus-image chunkdict save` to store the information of the chunk and blob in the chunk and blob table of the database.
40
+
***
41
+
`nydusify chunkdict generate` calls subcommand `nydus-image chunkdict generate` to store image information into the database and generate a new bootstrap as chunkdict bootstrap.
29
42
43
+
Download multiple Nydus images in advance and put them into the repository as datasets, such as selecting 10 consecutive versions of redis and alpine as the image dataset, and execute the command `nydus-image chunkdict generate` to store the information of the chunk and blob in the chunk and blob table of the database.
@@ -77,10 +88,9 @@ where $C(R_x)$ represents the unique chunk set of all training set images in the
77
88
**6.** Remove the chunk in the chunk dictionary selected in 5 for all images (training set and test set), and then repeat the operation 1-5 to generate the chunk dictionary until the maximum number of cycles is reached 7, or the discrete image ratio is greater than 80% of the total number of images.
78
89
79
90
The principle of DBSCAN algorithm how to divide the cluster is shown in the diagram:
In this diagram, minPts = 4. Point A and the other red points are core points, because the area surrounding these points in an ε radius contain at least 4 points (including the point itself). Because they are all reachable from one another, they form a single cluster. Points B and C are not core points, but are reachable from A (via other core points) and thus belong to the cluster as well. Point N is a noise point that is neither a core point nor directly-reachable.
**Remark:** This section of the picture and the associated DBSCAN algorithm description are referenced from : [https://en.wikipedia.org/wiki/DBSCAN](https://en.wikipedia.org/wiki/DBSCAN)
93
+
84
94
#### Algorithm 2 Deduplication between different versions of the image (exponential smoothing algorithm)
85
95
***
86
96
**Basic principle:** Exponential smoothing algorithm is a method for time series data prediction and smoothing, the basic principle is to weighted average the data, give higher weight to the more recent repeated chunks, and constantly update the smoothing value, so the newer chunk has a greater impact on future forecasts, and the impact of older data will gradually weaken.
@@ -102,16 +112,20 @@ where, $\alpha=0.5$ , $Y_{t-1}$ indicates whether the chunk appeared in the prev
102
112
103
113
**5.** Choose a chunk dictionary that minimizes the test set's storage space.
104
114
***
115
+
116
+
105
117
### Exponential smoothing algorithm test table
118
+
Step 1: Download 10 OCI versions of an image and count the total size
119
+
Step 2: Convert OCI to nydus image, and then count the total size after conversion
120
+
Step 3: Select three versions of the image to generate chunkdict, use chunkdict to convert the remaining seven versions of the image, and then count the total size
121
+
dedulicating rate = (total_image_size(nydus) - total_image_size (nydus after dedulicating))/total_image_size(nydus)
122
+
123
+
124
+
125
+
| image_name | version number | total_image_size(OCI) | total_image_size(nydus) | total_image_size (nydus after dedulicating) | chunkdict_image_size | dedulicating rate |
0 commit comments