Add support for memory pages compression#2895
Open
rst0git wants to merge 18 commits into
Open
Conversation
4 tasks
avagin
reviewed
Feb 18, 2026
avagin
reviewed
Feb 18, 2026
avagin
reviewed
Feb 18, 2026
5c86e95 to
50b748b
Compare
ee2618a to
7ac3c61
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds optional LZ4-based compression for memory pages images (pages.img) in CRIU, recording per-page compressed sizes in the pagemap image so restore can locate and decompress pages correctly (including for streaming, pre-dump chains, and page-server flows).
Changes:
- Introduces
--compress/RPC support and persists the setting in inventory images. - Extends pagemap images with
compressed_size[]andtotal_compressed_size, and updates dump/restore page I/O paths (including a helper daemon for PIE restore). - Updates ZDTM and CI scripts to exercise compressed dumps/restores.
Reviewed changes
Copilot reviewed 33 out of 33 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
criu/compression.c |
New LZ4 compression/decompression + helper daemon for restore-time decompression. |
criu/include/compression.h |
Public compression API and page-size bound macro. |
criu/include/cr_options.h |
Adds pages_compression option. |
criu/config.c |
Adds -c/--compress option parsing. |
criu/cr-service.c |
Wires RPC option to enable compression. |
criu/crtools.c |
Adds CLI help text for --compress (under CONFIG_LZ4). |
criu/page-xfer.c / criu/include/page-xfer.h |
Implements compressed write path (local + page-server receive side buffering). |
criu/pagemap.c / criu/include/pagemap.h |
Implements compressed read paths (local + streaming) and carries compressed metadata into restorer args. |
criu/mem.c |
Starts helper daemon and passes pipe fds to restorer. |
criu/pie/restorer.c / criu/include/restorer.h |
Adds compressed restore path via pipe protocol to helper daemon. |
criu/cr-restore.c |
Fixes up restorer pointers for compressed_size arrays. |
criu/image.c |
Persists compression setting in inventory.img and enables it on restore when present. |
images/pagemap.proto |
Adds compressed_size[] and total_compressed_size. |
images/inventory.proto |
Adds pages_compression to inventory entry. |
images/rpc.proto |
Adds RPC compress boolean option. |
criu/unittest/unit.c / criu/Makefile* |
Adds unit test coverage and build integration for compression module. |
test/zdtm.py / scripts/ci/run-ci-tests.sh |
Adds --compress wiring and CI test runs. |
Makefile.config / dependency scripts |
Adds LZ4 feature detection and distro package dependencies. |
Documentation/criu.txt |
Documents --compress. |
contrib/criu-compression-benchmark.py |
Adds benchmarking script for compression impact. |
3ad2776 to
da288db
Compare
da288db to
8b1ef89
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## criu-dev #2895 +/- ##
============================================
- Coverage 57.04% 56.73% -0.31%
============================================
Files 154 156 +2
Lines 40534 41796 +1262
Branches 8882 9175 +293
============================================
+ Hits 23123 23715 +592
- Misses 17057 17727 +670
Partials 354 354 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
8b1ef89 to
6682d0a
Compare
6a5785d to
ab5bd78
Compare
ab5bd78 to
7dee7b2
Compare
f16adeb to
f39acea
Compare
Add build system plumbing for LZ4 compression. When liblz4 is found via pkg-config, CONFIG_LZ4 is defined and the library is linked. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Add the protobuf fields used to encode memory page compression both in images and on the wire. - (inventory) uint32 compress: compression mode for the dump, encoded with enum compress_mode values: 0 = off, 1 = per-page, 2 = region. Lets the restore side detect and reproduce the compression encoding automatically. - (pagemap) repeated uint32 compressed_size: per-block compressed size array. Each value is the number of bytes the compressed block occupies in the pages image. In per-page mode each block is one page; in region mode each block covers up to region_pages consecutive pages. Sentinel values: 0 = all-zero block (no payload is stored), block bytes = stored raw (no decompression needed), anything else = LZ4-compressed block of that size. - (pagemap) uint64 total_compressed_size: sum of compressed_size[]. Used to size the read in one pread(); uint64 is needed because a single pagemap entry can cover millions of pages and the sum can exceed 4 GiB. - (pagemap) uint32 region_pages: number of pages per compressed block in region mode. Absent or 0 means per-page compression. - (rpc) uint32 compress: same encoding as the inventory field. - (rpc) uint32 compress_acceleration: LZ4 acceleration value. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Add compression.h with public helpers used by the dump and restore paths. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Add CLI options to enable memory page compression and the
corresponding feature check used to gate the LZ4 build.
-c, --compress enable per-page LZ4 compression
--compress-region SIZE enable region LZ4 compression with
the given region size; SIZE accepts
K/M/G suffixes (e.g. 256K, 1M)
--compress-acceleration N LZ4 acceleration; implies --compress
if no other mode is set
criu check --feature compress
The selected mode is stored in opts.compress_mode (enum compress_mode
value) and persisted in the inventory image so that the restore
side detects the encoding automatically. When CRIU is built without
CONFIG_LZ4, the option is rejected early in check_options() with a
clear error message. --compress-region is also rejected when used
with --page-server or --stream, because those wire formats are
per-page only.
The RPC interface accepts the same options via the compress,
compress_acceleration and compress_region_size fields.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Add the local-image and page-server write paths for memory page compression. write_pagemap_loc_compressed() and write_pages_loc_compressed() buffer per-block compressed sizes into pending_pe and flush a PagemapEntry once all blocks of an iovec have been compressed. The loop body is parameterised on pending_pe.region_pages: when 0, each page is compressed independently; when non-zero, pages are accumulated into regions of region_pages and compressed as a single LZ4 block. Zero pages and zero regions are stored with compressed_size=0 (no image payload); blocks that do not compress below the 7/8 store-raw threshold are written verbatim. For the page server, add PS_IOV_ADD_F_COMPRESSED and write_pages_to_server_compressed(): pages are compressed before being sent over the network and the receiver writes the compressed bytes to the local image without re-compressing. write_fd_full() handles short writes on the pages image. close_page_xfer() frees pending_pe.compressed_size on error paths; it is initialised to NULL so the unused-branch close is a no-op. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Add comments to the page_read function pointers and data fields. No functional changes. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The PIE restorer cannot link against LZ4, so a helper daemon process handles decompression. The daemon is forked in prepare_vma_ios() and communicates with the restorer over a pair of pipes. Wire protocol header (struct pipe_hdr): pid_t remote_pid; off_t offs; /* file offset in pages.img */ uint64_t total_compressed_size; int n_pages; /* total pages in request */ int nr_iovs; /* number of destination iovecs */ int n_blocks; /* count of compressed_size[] */ uint32_t region_pages; /* 0 = per-page, >0 = region */ After the header come compressed_size[n_blocks]; in region mode the daemon then reads block_pages[n_blocks] (uint16 per block) giving each block's actual page count (the last block of an entry may be shorter than region_pages). The remote-destination iovs[nr_iovs] follow last. The daemon reads compressed data with a single pread() per request, decompresses block-by-block (one page in per-page mode, up to region_pages pages in region mode), and writes the result into the target process via process_vm_writev(). Zero pages are not written at all; the target process VMAs are MAP_ANONYMOUS, so unwritten pages remain on the kernel zero page and do not consume physical memory. The decompression buffer is mmap(MAP_ANONYMOUS) with MADV_HUGEPAGE to enable the fast GUP path in process_vm_writev() and to reduce TLB misses. MADV_DONTNEED re-zeros the buffer between requests. posix_fadvise(FADV_DONTNEED) is called after each batch read to release page cache for already-read compressed data. Per-block compressed sizes (and per-block page counts in region mode) are validated against the corresponding bounds before use to prevent out-of-bounds reads from corrupted images. Negative n_pages/nr_iovs/n_blocks values are rejected. The process_vm_writev() iovec count is capped at IOV_MAX per call. Pipe I/O uses pipe_write_full()/pipe_read_full() in the PIE restorer and read_full() in the daemon to handle short reads and writes on pipe buffer boundaries. The daemon PID is stored in decompress_daemon_pid in task_restore_args instead of appending to the helpers array, which would corrupt the array built by collect_helper_pids(). The restorer waits for the daemon explicitly after closing the pipes. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Add maybe_read_page_local_compressed() and maybe_read_page_img_streamer_compressed() for restoring compressed pages from local images and streaming pipes respectively. Both readers fall back to the uncompressed path when a pagemap entry has no compressed_size array, which happens with shared memory pagemaps or entries from uncompressed parent images. They also dispatch on pe->region_pages: per-page mode uses read_compressed_pages(), which decompresses page-by-page directly into the destination buffer; region mode uses read_compressed_pages_region(), which decompresses an entire block (up to region_pages pages) into a heap scratch buffer and copies the requested page slice into the destination iovec, supporting partial-region reads via an in-block cursor (region_block_offset). skip_pagemap_pages() advances pi_off by summing per-block compressed sizes; in region mode it walks block-by-block and keeps region_block_offset consistent so partial-region skips remain correct. Per-block compressed sizes are validated against PAGE_COMPRESSED_SIZE_BOUND or REGION_COMPRESSED_SIZE_BOUND(n_pages) as appropriate. Zero blocks (compressed_size=0) are restored with memset. The pread() calls loop to handle short reads. The PR_ASYNC flag is supported. Compressed reads are enqueued via pagemap_enqueue_iovec(); coalescing requires matching region_pages between piovs. process_async_reads() reads all compressed data in one pread() call and decompresses block-by-block into the destination iovecs, with a direct-into-iovec fast path in region mode when a block fits inside a single destination slot. posix_fadvise(FADV_SEQUENTIAL) is applied to the pages image fd to hint the kernel for aggressive readahead. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
process_async_reads() allocates a single buffer for all compressed data in a piov batch. When pages coalesce into one giant piov (common with large GPU checkpoints), the buffer can exceed host memory. For example, checkpointing LLaMA 3.1-8B running on A100-SXM4-80GB has 77 GB of memory and produces ~72 GiB of compressed data. Thus, without this patch it would require 72 GiB for the decompression buffer and 77 GiB of premapped pages: 149 GiB total. This can exceed host memory and result in OOM during restore. Cap compressed piov batches at 1 GiB of compressed data during coalescing in pagemap_enqueue_iovec(). Larger checkpoints split into multiple batches, each allocating a bounded decompression buffer. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Wire up the memory page compression options through the zdtm test framework for both CLI and RPC modes: -c, --compress --compress-region SIZE (K/M/G suffix accepted) --compress-acceleration N The page-count validation auto-detects compression from the test descriptor opts, so the flags work whether they come from the CLI or from a .desc file. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Add zdtm tests that verify memory page content after checkpoint/restore with compression, in both per-page and region modes: compress_pages00 / compress_pages_region00: single process with zero-filled pages, compressible pattern pages, and incompressible random pages. Exercises all three compression outcomes (zero-skip, LZ4 compressed, raw fallback). compress_pages01 / compress_pages_region01: parent/child process tree with copy-on-write pages. Parent fills 64 pages, child modifies 16 of them. After restore, both parent and child verify their respective views byte-by-byte. compress_pages02 / compress_pages_region02: eight different mapping types in a parent/child tree -- MAP_PRIVATE anonymous (data and zeros), MAP_SHARED anonymous, private and shared file-backed, memfd shared, read-only (PROT_READ after mprotect), and PROT_NONE guard page adjacent to a data page. The compress_pages_region* siblings share C source with the per-page tests (via symlinks) and differ only in their .desc opts string. All tests use the compress feature check to auto-skip when CRIU is built without LZ4. The .desc files set --compress (-c) or --compress-region=256K so compression is always active and the tests run with --pre, --page-server, --lazy-pages, --stream, etc. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Add compression test coverage to the CI script: - iterative checkpointing with compression - iterative + dedup, iterative + page-server - compress_pages tests in basic, iterative, page-server, dedup, and lazy-pages modes - streaming tests with compress_pages - mixed-compression parent chain test Add test/others/compress-mixed/ which tests mixed-compression parent chains: two uncompressed pre-dumps followed by a compressed final dump, then restore. This exercises the per-entry fallback in the compressed reader when parent pagemap entries have no compressed_size array. Add shellcheck coverage for test/others/compress-mixed/. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
f39acea to
6c9c2da
Compare
Add a tool that measures the storage overhead and checkpoint/restore
latency impact of CRIU's page compression across compression modes
and memory patterns.
The compression modes are:
none - pages written raw, no compression (the baseline)
per-page - each 4 KiB page is its own LZ4 block
region - N consecutive pages share one LZ4 block
Workload patterns:
zero - highly compressible
mixed - 50% zero / 25% repeating / 25% random
random - incompressible
text - JSON-shaped)
elf - concatenated system binaries
The tool shows compression ratio, dump and restore latency
(median with interquartile range), throughput, and CRIU
stats counters. It also validates memory integrity via
SHA-256 across each restore.
Example:
sudo python3 contrib/compression-benchmark/main.py \
-p mixed text elf --modes none per-page region \
--region-sizes 65536 262144 1048576 --json out.json
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
09509d8 to
8f90311
Compare
Add a tool that measures the storage overhead and checkpoint/restore
latency impact of CRIU's page compression on a real GPU workload: an
SGLang inference server running inside a Podman container.
Each trial starts an SGLang container, validates inference with a chat
request, checkpoints the container, removes it, restores it, and
validates inference again. CRIU memory-page compression is varied
through /etc/criu/runc.conf while Podman's own archive compression is
kept at "none" by default, so the reported archive size reflects CRIU
image size rather than tar-level gzip/zstd compression.
The compression modes are:
none - pages written raw, no compression (the baseline)
per-page - each 4 KiB page is its own LZ4 block
region - N consecutive pages share one LZ4 block
The tool reports archive size, compression ratio, and median
checkpoint, restore, and post-restore request latency across modes,
running a warmup pass before the measured iterations.
Example:
sudo HF_TOKEN=... python3 \
contrib/compression-benchmark/podman-sglang.py \
--model Qwen/Qwen3-0.6B -n 3 \
--modes none per-page region --json out.json
Assisted-by: Claude Code:claude-opus-4-8[1m]
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Add offline tools to convert checkpoint images between compressed and uncompressed formats: crit compress <dir> -- compress memory pages with LZ4 crit decompress <dir> -- decompress memory pages By default, original files are backed up as .bak. Use --in-place to skip backups. The --acceleration flag controls LZ4 speed/ratio trade-off. Requires the Python lz4 package (optional dependency, added to all package manager dependency lists). When lz4 is not installed, other crit commands work normally and the compress/decompress commands print install instructions. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Add five tests covering compress and decompress round-trips using compress_pages02 which exercises all eight mapping types (anonymous, zeros, shared, file-backed, memfd, read-only, guard pages). - compressed dump, decompress with crit, restore and verify - uncompressed dump, compress with crit, restore and verify - compress already compressed, decompress already decompressed - compress, decompress, compress, verify pages are identical - decompress, compress, decompress, verify pages are identical Each restore runs the test process which verifies all memory regions byte-by-byte. The round-trip tests also compare md5 checksums of the raw pages data across cycles. When lz4 or CRIU compression support is not available, the tests are skipped gracefully. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Test compress_data() / decompress_data() with zero-filled,
repeating pattern, pseudo-random, and single-byte pages across
three LZ4 acceleration levels.
Test compress_region() / decompress_region() with the same
patterns at region sizes {16, 64, 256} pages and acceleration
levels {1, 4, 32}, including an "all zeros except one non-zero
page" case to exercise the zero pre-pass fast path and per-page
zero detection inside the decompression result.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
8f90311 to
5c33042
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request extends CRIU with support for LZ4 compression applied during
dump/pre-dumpand reversed onrestore. The goal is to reduce the size of data stored inpages-*.imgfiles by compressing memory pages before they are written to the image and decompressing them after they are read back. This approach allows to minimize the amount of data written to and read from storage or transferred over the network (with page-server and criu-image-streamer).This functionality supports two compression modes:
-c,--compress): each 4 KiB page is compressed as its own LZ4 block.--compress-region SIZE): consecutive pages are grouped into regions of SIZE bytes, and each region is compressed as a single LZ4 block (default 256K, max 4M).In addition, the
--compress-acceleration Noption exposes an LZ4 parameter providing a trade-off between speed and compression ratio. It controls how often LZ4 looks for repeated data while compressing. A value of 1 checks the input most thoroughly and usually produces the best compression ratio. Higher values cause LZ4 to skip more locations between searches and reduce the amount of work it does. This makes compression faster but increases the chance of missing repeated patterns. Decompression speed is not affected by this parameter.Each compressed block stores its size in
pagemap_entry.compressed_size[], with three cases:0: all-zero block (no payload written)block_bytes: stored raw (incompressible, or below the 7/8 threshold where compressing isn't worth the restore cost)In addition,
total_compressed_sizelets the reader size a singlepread(), andregion_pagesdistinguishes the two modes. The compression mode is saved in the inventory image (on bothdumpandpre-dump) for automatic detection on restore.The following benchmark (
contrib/compression-benchmark/) results show the saved pages image size when running with 1 GB workload (median of 5 runs) on Intel Core Ultra 9 275HX:Incompressible pages fall back to raw storage (7/8 threshold), so the worst case is break-even rather than a regression. Region mode improves the ratio on realistic data by sharing one LZ4 block across consecutive pages. The benchmark verifies memory integrity with SHA-256 after every restore.
The benchmark was run on an ext4 filesystem backed by a 3.6 TB Corsair NVMe SSD (ROTA=0). For this mixed workload, compression consistently reduces checkpoint size and improves both dump and restore time compared uncompressed checkpoints. This is because less data has to be written and read back. However, these results are hardware-dependent. While compression reliably reduces the checkpoint size, the dump/restore time depends on whether the system is more I/O-bound or CPU-bound.
The following are evaluation results for SGLang:
The following presentation and research paper provide the motivation and a detailed evaluation of this work in the context of low-latency elastic inference serving:
This pull request also extends CRIT with two offline image-conversion commands that mirror the built-in CRIU compression:
crit compress <dir>: Compress the memory pages of a checkpoint directory in place.crit decompress <dir>: Decompress the memory pages of a checkpoint directory back to their uncompressed formBoth commands default to writing
.bakbackups of the files they rewrite, and support--in-placeoption that skips creating these backups.