Skip to content

Change block index verification#2056

Draft
braingram wants to merge 5 commits into
asdf-format:mainfrom
braingram:change_index_verification
Draft

Change block index verification#2056
braingram wants to merge 5 commits into
asdf-format:mainfrom
braingram:change_index_verification

Conversation

@braingram
Copy link
Copy Markdown
Contributor

Prior to this PR the first and last blocks were loaded.

Now, prior to the index finding the start of the first block is found
The first blocks typically starts immediately after the tree.
This is nice because the code just finished reading the tree meaning
checking for the first block should usually only take reading 4 bytes
to read the block magic. The exception is when the tree is padded.
Here the entire padding needs to be read.

Now that the code knows where the first block starts, it uses
that offset to check the block index (if one is found). If the
first index doesn't match, the code falls back to reading serially.
If it matches, the code uses the block index.

This access pattern should be more efficient (and a better match
to cloud access) where the old code did something like the following.

=== old routine ===

Assume we have a file with:
[ ---- tree --- padding blk0 blk1 ... blkN index ]

For old and new reading starts with the tree
fp index
|
[ ---- tree --- ********************************** ]

For old next the index was found which required a seek to the end
then reading backwards to find the index
|----------------------- seek --->|
[ ---- tree --- **************************** index ]

then a seek to the start to read the first block (after padding)
|<-seek-----------------|
[ ---- tree --- padding blk0 ************* index ]

then a seek to the last block and read it
---seek->|
[ ---- tree --- padding blk0 ******** blkN index ]

=== new routine ===

For the new routine the process is simplier and starts the same with
reading the tree:
fp index
|
[ ---- tree --- ********************************** ]

Then the first block is found (just the magic)
|
[ ---- tree --- padding ************************ ]

Then the file is searched for a block index (and read):
|-------------- seek --->|
[ ---- tree --- padding *******************index ]

and that's it.

One possible (and very unlikey) downside is if a file somehow has

  • a valid block index
  • a valid first block offset
  • an invalid last block offset

we would not catch the error (but we would before). It seems impossible
to generate such a file without manually modifying block headers. If
that's happening there is no reason non first/last blocks might also
be incorrect so the old routine was also not absolute.

Tasks

  • run prek on your machine
  • run pytest on your machine
  • Does this PR add new features and / or change user-facing code / API? (if not, label with no-changelog-entry-needed)
    • write news fragment(s) in changes/: echo "changed something" > changes/<PR#>.<changetype>.rst (see below for change types)
    • update relevant docstrings and / or docs/ page
    • for any new features, add unit tests
news fragment change types...
  • changes/<PR#>.feature.rst: new feature
  • changes/<PR#>.bugfix.rst: bug fix
  • changes/<PR#>.doc.rst: documentation change
  • changes/<PR#>.removal.rst: deprecation or removal of public API
  • changes/<PR#>.general.rst: infrastructure or miscellaneous change

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Jun 1, 2026

Merging this PR will not alter performance

✅ 122 untouched benchmarks


Comparing braingram:change_index_verification (7ffbed1) with main (71187c0)

Open in CodSpeed

@braingram braingram force-pushed the change_index_verification branch 2 times, most recently from c75ea8e to 6e9c45d Compare June 1, 2026 15:54
braingram added 5 commits June 1, 2026 16:04
Prior to this commit the first and last blocks were loaded.

Now, prior to the index finding the start of the first block is found
The first blocks typically starts immediately after the tree.
This is nice because the code just finished reading the tree meaning
checking for the first block should usually only take reading 4 bytes
to read the block magic. The exception is when the tree is padded.
Here the entire padding needs to be read.

Now that the code knows where the first block starts, it uses
that offset to check the block index (if one is found). If the
first index doesn't match, the code falls back to reading serially.
If it matches, the code uses the block index.

This access pattern should be more efficient (and a better match
to cloud access) where the old code did something like the following.

=== old routine ===

Assume we have a file with:
[ ---- tree --- _padding_ blk0 blk1 ... blkN index ]

For old and new reading starts with the tree
             fp index
                |
[ ---- tree --- ********************************** ]

For old next the index was found which required a seek to the end
then reading backwards to find the index
                |----------------------- seek --->|
[ ---- tree --- **************************** index ]

then a seek to the start to read the first block (after padding)
                          |<-seek-----------------|
[ ---- tree --- _padding_ blk0 ************* index ]

then a seek to the last block and read it
                              ---seek->|
[ ---- tree --- _padding_ blk0 ******** blkN index ]

=== new routine ===

For the new routine the process is simplier and starts the same with
reading the tree:
             fp index
                |
[ ---- tree --- ********************************** ]

Then the first block is found (just the magic)
                          |
[ ---- tree --- _padding_ ************************ ]

Then the file is searched for a block index (and read):
                          |-------------- seek --->|
[ ---- tree --- _padding_ *******************index ]

and that's it.

One possible (and very unlikey) downside is if a file somehow has
- a valid block index
- a valid first block offset
- an invalid last block offset

we would not catch the error (but we would before). It seems impossible
to generate such a file without manually modifying block headers. If
that's happening there is no reason non first/last blocks might also
be incorrect so the old routine was also not absolute.
@braingram braingram force-pushed the change_index_verification branch from 6e9c45d to 7ffbed1 Compare June 1, 2026 20:04
Comment thread asdf/_block/reader.py
return blocks

# skip magic for each block
fd.seek(starting_offset)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seek is here only to keep sphinx-asdf from crashing for asdf directives that show blocks. Once #2057 is merged we can remove this seek.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant