Change block index verification#2056
Draft
braingram wants to merge 5 commits into
Draft
Conversation
c75ea8e to
6e9c45d
Compare
Prior to this commit the first and last blocks were loaded.
Now, prior to the index finding the start of the first block is found
The first blocks typically starts immediately after the tree.
This is nice because the code just finished reading the tree meaning
checking for the first block should usually only take reading 4 bytes
to read the block magic. The exception is when the tree is padded.
Here the entire padding needs to be read.
Now that the code knows where the first block starts, it uses
that offset to check the block index (if one is found). If the
first index doesn't match, the code falls back to reading serially.
If it matches, the code uses the block index.
This access pattern should be more efficient (and a better match
to cloud access) where the old code did something like the following.
=== old routine ===
Assume we have a file with:
[ ---- tree --- _padding_ blk0 blk1 ... blkN index ]
For old and new reading starts with the tree
fp index
|
[ ---- tree --- ********************************** ]
For old next the index was found which required a seek to the end
then reading backwards to find the index
|----------------------- seek --->|
[ ---- tree --- **************************** index ]
then a seek to the start to read the first block (after padding)
|<-seek-----------------|
[ ---- tree --- _padding_ blk0 ************* index ]
then a seek to the last block and read it
---seek->|
[ ---- tree --- _padding_ blk0 ******** blkN index ]
=== new routine ===
For the new routine the process is simplier and starts the same with
reading the tree:
fp index
|
[ ---- tree --- ********************************** ]
Then the first block is found (just the magic)
|
[ ---- tree --- _padding_ ************************ ]
Then the file is searched for a block index (and read):
|-------------- seek --->|
[ ---- tree --- _padding_ *******************index ]
and that's it.
One possible (and very unlikey) downside is if a file somehow has
- a valid block index
- a valid first block offset
- an invalid last block offset
we would not catch the error (but we would before). It seems impossible
to generate such a file without manually modifying block headers. If
that's happening there is no reason non first/last blocks might also
be incorrect so the old routine was also not absolute.
6e9c45d to
7ffbed1
Compare
braingram
commented
Jun 1, 2026
| return blocks | ||
|
|
||
| # skip magic for each block | ||
| fd.seek(starting_offset) |
Contributor
Author
There was a problem hiding this comment.
This seek is here only to keep sphinx-asdf from crashing for asdf directives that show blocks. Once #2057 is merged we can remove this seek.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Prior to this PR the first and last blocks were loaded.
Now, prior to the index finding the start of the first block is found
The first blocks typically starts immediately after the tree.
This is nice because the code just finished reading the tree meaning
checking for the first block should usually only take reading 4 bytes
to read the block magic. The exception is when the tree is padded.
Here the entire padding needs to be read.
Now that the code knows where the first block starts, it uses
that offset to check the block index (if one is found). If the
first index doesn't match, the code falls back to reading serially.
If it matches, the code uses the block index.
This access pattern should be more efficient (and a better match
to cloud access) where the old code did something like the following.
=== old routine ===
Assume we have a file with:
[ ---- tree --- padding blk0 blk1 ... blkN index ]
For old and new reading starts with the tree
fp index
|
[ ---- tree --- ********************************** ]
For old next the index was found which required a seek to the end
then reading backwards to find the index
|----------------------- seek --->|
[ ---- tree --- **************************** index ]
then a seek to the start to read the first block (after padding)
|<-seek-----------------|
[ ---- tree --- padding blk0 ************* index ]
then a seek to the last block and read it
---seek->|
[ ---- tree --- padding blk0 ******** blkN index ]
=== new routine ===
For the new routine the process is simplier and starts the same with
reading the tree:
fp index
|
[ ---- tree --- ********************************** ]
Then the first block is found (just the magic)
|
[ ---- tree --- padding ************************ ]
Then the file is searched for a block index (and read):
|-------------- seek --->|
[ ---- tree --- padding *******************index ]
and that's it.
One possible (and very unlikey) downside is if a file somehow has
we would not catch the error (but we would before). It seems impossible
to generate such a file without manually modifying block headers. If
that's happening there is no reason non first/last blocks might also
be incorrect so the old routine was also not absolute.
Tasks
prekon your machinepyteston your machineno-changelog-entry-needed)changes/:echo "changed something" > changes/<PR#>.<changetype>.rst(see below for change types)docs/pagenews fragment change types...
changes/<PR#>.feature.rst: new featurechanges/<PR#>.bugfix.rst: bug fixchanges/<PR#>.doc.rst: documentation changechanges/<PR#>.removal.rst: deprecation or removal of public APIchanges/<PR#>.general.rst: infrastructure or miscellaneous change