-
-
Notifications
You must be signed in to change notification settings - Fork 814
Closed
Description
current state (current master branch, borg 1.x, borg 0.x, attic)
A borg repository is primarily a key/value store (with some aux functions).
The key is the chunk id (== MAC(plaintext)), the value is the compressed/encrypted/authenticated data.
borg uses transactions and a LOG when writing to the repo:
- start of transaction (usually triggered by PUT/DEL)
- writes more objects by appending PUT entries to the log
- deletes objects by appending DEL entries to the log
- commits (appends a COMMIT entry to the log)
- end of transaction (S: saves repo index and hints, C: saves chunks index and files cache)
LOG means that new stuff is always appended at the end of the last/current segment file. In general, old segment files are never modified in place.
borg compact defrags non-compact segment files:
- a segment file contains PUTs, DELs, COMMITs
- if a PUT(id) is later deleted by a DEL(id), it creates a logical hole in a segment file (that object is not used any more), making it non-compact
- compaction / defragging works by reading all still-needed objects from an old segment file and appending them to a new segment file. after that is finished, the old segment file is deleted (and that frees disk space because the new segment file is smaller).
advantages of this approach
- transactions and append-only log are a very safe approaches (even if stuff crashes it usually can roll back to previous state and be fine again)
- segment files are medium size files: not too large, not too small, not too many
- works well even with not very scalable filesystems
- has little overhead due to fs block / cluster size
- can be copied or deleted rather quickly (not many fs objects)
disadvantages of this approach
- borg compact can cause lots of I/O when shuffling objects from old non-compact segments to new compact segments
- borg compact needs some space on the fs to be able to work. bad if your fs is 100% full...
- compaction code is rather complex, same for transaction management
- to quickly access objects, the repository needs an index mapping
id -> (segment, offset, flags) - borg currently loads the repo index (hashtable) into memory. RAM usage is about 44b * object_count + free space in hashtable. if you have a lot of files and/or a lot of data volume, repo index can need GBs of RAM.
- to implement this, some special borg code is needed with access to the repo filesystem
- hard to work like this without locking the repository against concurrent access.
rikimaru0345