Change block index verification#2056
Conversation
6e9c45d to
7ffbed1
Compare
fea3c41 to
bde3686
Compare
Prior to this commit the first and last blocks were loaded.
Now, prior to the index finding the start of the first block is found
The first blocks typically starts immediately after the tree.
This is nice because the code just finished reading the tree meaning
checking for the first block should usually only take reading 4 bytes
to read the block magic. The exception is when the tree is padded.
Here the entire padding needs to be read.
Now that the code knows where the first block starts, it uses
that offset to check the block index (if one is found). If the
first index doesn't match, the code falls back to reading serially.
If it matches, the code uses the block index.
This access pattern should be more efficient (and a better match
to cloud access) where the old code did something like the following.
=== old routine ===
Assume we have a file with:
[ ---- tree --- _padding_ blk0 blk1 ... blkN index ]
For old and new reading starts with the tree
fp index
|
[ ---- tree --- ********************************** ]
For old next the index was found which required a seek to the end
then reading backwards to find the index
|----------------------- seek --->|
[ ---- tree --- **************************** index ]
then a seek to the start to read the first block (after padding)
|<-seek-----------------|
[ ---- tree --- _padding_ blk0 ************* index ]
then a seek to the last block and read it
---seek->|
[ ---- tree --- _padding_ blk0 ******** blkN index ]
=== new routine ===
For the new routine the process is simplier and starts the same with
reading the tree:
fp index
|
[ ---- tree --- ********************************** ]
Then the first block is found (just the magic)
|
[ ---- tree --- _padding_ ************************ ]
Then the file is searched for a block index (and read):
|-------------- seek --->|
[ ---- tree --- _padding_ *******************index ]
and that's it.
One possible (and very unlikey) downside is if a file somehow has
- a valid block index
- a valid first block offset
- an invalid last block offset
we would not catch the error (but we would before). It seems impossible
to generate such a file without manually modifying block headers. If
that's happening there is no reason non first/last blocks might also
be incorrect so the old routine was also not absolute.
bde3686 to
257433f
Compare
sydduckworth
left a comment
There was a problem hiding this comment.
There's a chunk of code not being hit by unit tests, which is the logic described in #2062 that emits a warning if there's an empty block index but no blocks.
Should probably add tests to capture that behavior.
Otherwise LGTM
Thanks! Good eye finding the uncovered code. I pushed a change in 2e43751 that:
The updated test was causing issues with pyrefly. Part of this seemed to be an incorrect type for the return value of |
Ah yeah I think I set the typing of that function because when constructing the block index the values can be |
Prior to this PR the first and last blocks were loaded.
Now, prior to the index finding the start of the first block is found
The first blocks typically starts immediately after the tree.
This is nice because the code just finished reading the tree meaning
checking for the first block should usually only take reading 4 bytes
to read the block magic. The exception is when the tree is padded.
Here the entire padding needs to be read.
Now that the code knows where the first block starts, it uses
that offset to check the block index (if one is found). If the
first index doesn't match, the code falls back to reading serially.
If it matches, the code uses the block index.
This access pattern should be more efficient (and a better match
to cloud access) where the old code did something like the following.
=== old routine ===
Assume we have a file with:
[ ---- tree --- padding blk0 blk1 ... blkN index ]
For old and new reading starts with the tree
fp index
|
[ ---- tree --- ********************************** ]
For old next the index was found which required a seek to the end
then reading backwards to find the index
|----------------------- seek --->|
[ ---- tree --- **************************** index ]
then a seek to the start to read the first block (after padding)
|<-seek-----------------|
[ ---- tree --- padding blk0 ************* index ]
then a seek to the last block and read it
---seek->|
[ ---- tree --- padding blk0 ******** blkN index ]
=== new routine ===
For the new routine the process is simplier and starts the same with
reading the tree:
fp index
|
[ ---- tree --- ********************************** ]
Then the first block is found (just the magic)
|
[ ---- tree --- padding ************************ ]
Then the file is searched for a block index (and read):
|-------------- seek --->|
[ ---- tree --- padding *******************index ]
and that's it.
One possible (and very unlikey) downside is if a file somehow has
we would not catch the error (but we would before). It seems impossible
to generate such a file without manually modifying block headers. If
that's happening there is no reason non first/last blocks might also
be incorrect so the old routine was also not absolute.
Tasks
prekon your machinepyteston your machineno-changelog-entry-needed)changes/:echo "changed something" > changes/<PR#>.<changetype>.rst(see below for change types)docs/pagenews fragment change types...
changes/<PR#>.feature.rst: new featurechanges/<PR#>.bugfix.rst: bug fixchanges/<PR#>.doc.rst: documentation changechanges/<PR#>.removal.rst: deprecation or removal of public APIchanges/<PR#>.general.rst: infrastructure or miscellaneous change