Skip to content

Conversation

@dlrobertson
Copy link
Contributor

@dlrobertson dlrobertson commented Jul 13, 2021

Add documentation commets to the btree iter flag definitions.

CC: #269

Signed-off-by: Dan Robertson [email protected]

…dvance

The way btree iterators work internally has been changing, particularly
with the iter->real_pos changes, and bch2_btree_iter_next() is no longer
hyper optimized - it's just advance followed by peek, so it's more
efficient to just call advance where we're not using the return value of
bch2_btree_iter_next().

Signed-off-by: Kent Overstreet <[email protected]>
btree node iterators need to obey the regular btree node invarionts
w.r.t. iter->real_pos; once they do, bch2_btree_iter_traverse will have
less that it needs to check.

Signed-off-by: Kent Overstreet <[email protected]>
This means bch2_btree_iter_traverse_one() can be made more efficient.

Signed-off-by: Kent Overstreet <[email protected]>
Since we're no longer doing next() immediately followed by peek(), this
optimization isn't doing anything anymore.

Signed-off-by: Kent Overstreet <[email protected]>
This just gives some internal helpers some better names.

Signed-off-by: Kent Overstreet <[email protected]>
Ideally we'll be getting rid of peek_with_updates(), but the callers
will need to be checked.

Signed-off-by: Kent Overstreet <[email protected]>
peek() has to update iter->real_pos - there's no need for
bch2_btree_iter_set_pos() to update it as well.

Signed-off-by: Kent Overstreet <[email protected]>
More prep work for snapshots.

Signed-off-by: Kent Overstreet <[email protected]>
It was using the method for btree_ptr_v1, but that wasn't checking all
the fields.

Signed-off-by: Kent Overstreet <[email protected]>
It had some silly redundancies.

Signed-off-by: Kent Overstreet <[email protected]>
External (to the btree iterator code) users of bch2_btree_iter_traverse
expect that on success the iterator will be pointed at iter->pos and
have that position locked - but since we split iter->pos and
iter->real_pos, that means it has to update iter->real_pos if necessary.

Internal users don't expect it to modify iter->real_pos, so we need two
separate functions.

Signed-off-by: Kent Overstreet <[email protected]>
This adds a mode to six locks where readers use percpu counters -
avoiding writing to shared cachelines.

The algorithm is the same as the existing percpu-rwsemaphore's slowpath
algorithm: taking a read lock still requires a memory barrier to check
if we raced with another thread taking a write lock, but this means that
taking a write lock doesn't incur the cost of an RCU barrier.

Signed-off-by: Kent Overstreet <[email protected]>
The default was 1/256th of the device and capped at 512MB, which is
fairly tiny these days.

Signed-off-by: Kent Overstreet <[email protected]>
Bkey noops were introduced to deal with trimming inline data extents in
place in the btree: if the u64s field of a bkey was 0, that u64 was a
noop and we'd start looking for the next bkey immediately after it.

But extent handling has been lifted above the btree - we no longer
modify existing extents in place in the btree, and the compatibilty code
for old style extent btree nodes is gone, so we can completely drop this
code.

Signed-off-by: Kent Overstreet <[email protected]>
On btree node split, we weren't ensuring the min_key of the new larger
node packs in the new format for this node. This triggers some painful
slowpaths in the bset.c aux search tree code - this patch fixes that by
calculating a new format for the new node with the new min_key.

Signed-off-by: Kent Overstreet <[email protected]>
We weren't packing the min/max keys, which was a major oversight and
completely disabled generating bkey_floats for adjacent nodes.

Signed-off-by: Kent Overstreet <[email protected]>
…d to

When we pass BTREE_INSERT_NOUNLOCK bch2_trans_commit isn't supposed to
unlock after a successful commit, but it was calling
bch2_trans_cond_resched() - oops.

Signed-off-by: Kent Overstreet <[email protected]>
Since we now make sure to always generate packed bkey formats that can
pack the min_key of a btree node, this path should actually never
happen.

Signed-off-by: Kent Overstreet <[email protected]>
The btree key cache mutex was becoming a significant bottleneck - it was
mainly used to protect the lists of dirty, clean and freed cached keys.

This patch eliminates the dirty and clean lists - instead, when we need
to scan for keys to drop from the cache we iterate over the rhashtable,
and thus we're able to remove most uses of that lock.

Signed-off-by: Kent Overstreet <[email protected]>
With snapshots, we're going to need to differentiate between comparisons
that should and shouldn't include the snapshot field. bpos_cmp is now
the comparison function that does include the snapshot field, used by
core btree code.

Upper level filesystem code generally does _not_ want to compare against
the snapshot field - that code wants keys to compare as equal even when
one of them is in an ancestor snapshot.

Signed-off-by: Kent Overstreet <[email protected]>
This patch starts treating the bpos.snapshot field like part of the key
in the btree code:

* bpos_successor() and bpos_predecessor() now include the snapshot field
* Keys in btrees that will be using snapshots (extents, inodes, dirents
  and xattrs) now always have their snapshot field set to U32_MAX

The btree iterator code gets a new flag, BTREE_ITER_ALL_SNAPSHOTS, that
determines whether we're iterating over keys in all snapshots or not -
internally, this controlls whether bkey_(successor|predecessor)
increment/decrement the snapshot field, or only the higher bits of the
key.

We add a new member to struct btree_iter, iter->snapshot: when
BTREE_ITER_ALL_SNAPSHOTS is not set, iter->pos.snapshot should always
equal iter->snapshot, which will be 0 for btrees that don't use
snapshots, and alsways U32_MAX for btrees that will use snapshots
(until we enable snapshot creation).

This patch also introduces a new metadata version number, and compat
code for reading from/writing to older versions - this isn't a forced
upgrade (yet).

Signed-off-by: Kent Overstreet <[email protected]>
This patch adds two new inode fields, bi_dir and bi_dir_offset, that
point back to the inode's dirent.

Since we're only adding fields for a single backpointer, files that have
been hardlinked won't necessarily have valid backpointers: we also add a
new inode flag, BCH_INODE_BACKPTR_UNTRUSTED, that's set if an inode has
ever had multiple links to it. That's ok, because we only really need
this functionality for directories, which can never have multiple
hardlinks - when we add subvolumes, we'll need a way to enemurate and
print subvolumes, and this will let us reconstruct a path to a subvolume
root given a subvolume root inode.

Signed-off-by: Kent Overstreet <[email protected]>
For snapshots, when we allocate a new inode we want to allocate an inode
number that isn't in use in any other subvolume. We won't be able to use
ITER_SLOTS for this, inode allocation needs to change to use
BTREE_ITER_ALL_SNAPSHOTS.

Signed-off-by: Kent Overstreet <[email protected]>
Since move.c isn't aware of what subvolume we're in, we can't use the
standard inode lookup code - fortunately, we're just using it for
reading IO options.

Signed-off-by: Kent Overstreet <[email protected]>
This adds a new watermark for the journal reclaim when flushing btree
key cache entries - it should try and stay ahead of where foreground
threads doing transaction commits will enter direct journal reclaim.

Signed-off-by: Kent Overstreet <[email protected]>
This is specifically to speed up bch2_inode_rm(), so that we're not
traversing iterators we're done with.

Signed-off-by: Kent Overstreet <[email protected]>
@koverstreet koverstreet force-pushed the master branch 12 times, most recently from 7140fd9 to 48c4f56 Compare October 19, 2025 21:57
@koverstreet koverstreet force-pushed the master branch 6 times, most recently from 232d520 to b552eb1 Compare October 26, 2025 13:23
@koverstreet koverstreet force-pushed the master branch 4 times, most recently from 4b1309a to 0032e04 Compare November 9, 2025 15:53
@koverstreet koverstreet force-pushed the master branch 2 times, most recently from f65f527 to 156233a Compare November 12, 2025 14:57
@koverstreet koverstreet force-pushed the master branch 2 times, most recently from a7a3aa8 to e336760 Compare November 26, 2025 15:08
@koverstreet koverstreet force-pushed the master branch 2 times, most recently from 8fc646f to 9c6848f Compare January 7, 2026 07:32
@koverstreet koverstreet force-pushed the master branch 2 times, most recently from 8bbabb8 to e147a0f Compare January 13, 2026 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants