Skip to content

borg is faster than "desync make" despite borg is single-threaded #244

@safinaskar

Description

@safinaskar

Description. borg2 is significantly (x1.2-x2) faster than desync when storing particular 10 GiB file to empty storage despite borg2 is single-threaded and desync is multi-threaded. I tested this on two different machines. And both time I get significant time difference. (But theoretically desync should be significantly faster than borg, because desync is parallel). Same compression and chunking settings was used.

Steps to reproduce.

Install borg. Make sure you installed borg2 as opposed to borg1. (I tested borg2 only, I don't know how well borg1 performs). I used borg 2.0.0b5 for my tests. I installed borg2 from debian repository.

Build desync. I used desync from a005278. I installed debian package golang-go and did git clone https://github.com/folbricht/desync.git; cd desync/cmd/desync; go build; cp desync /.

Make sure all data (both input file and repositories) located on tmpfs.

Also make sure you always add data (using desync make or borg create) to empty repository.

Grab this file: http://safinaskar.com/f/03.zst and decompress it. This is zstd-compressed VM image. The image contains nothing confidential, so can be distributed without restriction. This is simple Debian VM. Its size is 490M. After decompression you will get 10 GiB file.

Make sure that after decompression the file is not sparse. If you are unsure, then copy it using cp --sparse=never.

Create desync repo using these commands:

mkdir /home/user/dedup-bench/input-fs/desync-repo
mkdir /home/user/dedup-bench/input-fs/desync-repo/index
mkdir /home/user/dedup-bench/input-fs/desync-repo/storage.castr

(Of course, exact path doesn't matter. I copied here my actual path, but you can use some other path.)

Put 03 to desync repo using this command:

time -p /desync make -m 16:64:256 -s /home/user/dedup-bench/input-fs/desync-repo/storage.castr /home/user/dedup-bench/input-fs/desync-repo/index/a.caibx  /home/user/dedup-bench/input-fs/03

Create borg repo using these commands:

export BORG_PASSPHRASE=password
borg2 rcreate --encryption=authenticated-blake2 --repo=/home/user/dedup-bench/input-fs/borg-repo

That directory (/home/user/dedup-bench/input-fs/borg-repo in my case) should not exist before this command (i. e. borg2 rcreate).

Add 03 to borg repo using these commands:

export BORG_PASSPHRASE=password
time -p borg2 create --repo=/home/user/dedup-bench/input-fs/borg-repo --chunker-params buzhash,14,18,16,4095 --compression zstd,3 a /home/user/dedup-bench/input-fs/03

End of steps to reproduce.

In both cases we used same settings. zstd level 3 (as well as I understand, in desync we have zstd level 3 hardcoded). And buzhash (again, as well as I understand desync uses buzhash).

I used string 16:64:256 for configuring desync. As well as I understand this means that desync should use 16 KiB as minimal chunk size, 64 KiB as average and 256 KiB as maximal. I used string buzhash,14,18,16,4095 for configuring borg. This means using 2 ^ 14 bytes as minimal chunk size, 2 ^ 16 bytes as average and 2 ^ 18 bytes as maximal. I. e. the same.

I did this test on two machines. First machine is AWS bare metal server. It was configured for benchmarks using these advises: https://easyperf.net/blog/2019/08/02/Perf-measurement-environment-on-Linux . It has 48 cpu cores, so I resticted my tests to 4 cpu cores by prepending taskset 0x0000000F to commands. (You may say that taskset confused desync and desync was not able to discover available parallelism. I don't think so, because test results on this machine match test results on another machine, where taskset was not used.) (Side note: if I don't restrict execution to 4 cpu cores and use all 48 cpu cores, then desync becomes faster than borg.)

AWS is debian bookworm.

Okay, so here are AWS results:

admin@ip-172-31-23-30:~$ time -p taskset 0x0000000F /desync make -m 16:64:256 -s /tmp/tmpfs/desync-repo/storage.castr /tmp/tmpfs/desync-repo/index/a.caibx /tmp/tmpfs/sparse-never-sto/03
Chunking [=========================================================================================================================================] 100.00% 20s
Storing [===========================================================================================================================================] 100.00% 9s
real 29.58
user 110.99
sys 4.42

admin@ip-172-31-23-30:/tmp/tmpfs$ BORG_PASSPHRASE=password time -p taskset 0x0000000F borg2 create --repo=/tmp/tmpfs/repo --chunker-params buzhash,14,18,16,4095 --compression zstd,3 a /tmp/tmpfs/sparse-never-sto/03 
real 24.39
user 22.65
sys 1.81

As you can see borg time is 24.39 and desync time is 29.58, i. e. desync is x1.2 slower.

Second machine is my personal laptop Dell Inspiron. Tests was performed in debian sid inside of docker container in debian stretch. Machine has 4 cpus. No taskset was used. Machine was not tuned for benchmarks. It was under heavy load. But I did tests two times and both times difference was huge. So I assume heavy load was not reason for such results. Okay, so here are results:

<[sid]>root@4af97530f96d:~# time -p /desync make -m 16:64:256 -s /home/user/dedup-bench/input-fs/desync-repo/storage.castr /home/user/dedup-bench/input-fs/desync-repo/index/a.caibx  /home/user/dedup-bench/input-fs/03
Chunking [=========================================================================================================================================] 100.00% 26s
Storing [==========================================================================================================================================] 100.00% 15s
real 42.22
user 136.80
sys 5.52

<[sid]>root@4af97530f96d:~# time -p /desync make -m 16:64:256 -s /home/user/dedup-bench/input-fs/desync-repo/storage.castr /home/user/dedup-bench/input-fs/desync-repo/index/a.caibx  /home/user/dedup-bench/input-fs/03
Chunking [=========================================================================================================================================] 100.00% 28s
Storing [==========================================================================================================================================] 100.00% 15s
real 43.42
user 141.85
sys 5.13

<[sid]>root@29ffe8165cb8:/home/user/dedup-bench/input-fs# BORG_PASSPHRASE=password time -p borg2 create --repo=/home/user/dedup-bench/input-fs/borg-repo --chunker-params buzhash,14,18,16,4095 --compression zstd,3 a /home/user/dedup-bench/input-fs/03
real 23.74
user 19.69
sys 2.19

<[sid]>root@29ffe8165cb8:/home/user/dedup-bench/input-fs# BORG_PASSPHRASE=password time -p borg2 create --repo=/home/user/dedup-bench/input-fs/borg-repo --chunker-params buzhash,14,18,16,4095 --compression zstd,3 a /home/user/dedup-bench/input-fs/03
real 20.82
user 19.46
sys 1.30

As you can see, desync is nearly x2 slower than borg. Repo size for desync is 525M and repo size for borg is 522M.

List of borg's optimizations:

  • borg uses blake2b as its strong hash for comparing chunks (in this test)
  • borg has special support for storing sparse files to its repo (but this should not matter for this test, because our file is not sparse) (this optimization is present in fixed codepath, I don't know whether it is present in buzhash codepath)
  • borg checks whether given chunk is all-zero. If it is and hash (i. e. blake2b for this test) for chunk of such size was already computed before, then borg takes already computed hash (this optimization is present in fixed codepath, I don't know whether it is present in buzhash codepath) (borg does this optimization even if file is not sparse)
  • after reading chunk borg drops it from OS cache using posix_fadvise with DONTNEED (I saw this optimization in fixed codepath). But this should not matter in our case, because in our case the file is stored in tmpfs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions