Change block index verification by braingram · Pull Request #2056 · asdf-format/asdf

braingram · 2026-06-01T15:06:51Z

Prior to this PR the first and last blocks were loaded.

Now, prior to the index finding the start of the first block is found
The first blocks typically starts immediately after the tree.
This is nice because the code just finished reading the tree meaning
checking for the first block should usually only take reading 4 bytes
to read the block magic. The exception is when the tree is padded.
Here the entire padding needs to be read.

Now that the code knows where the first block starts, it uses
that offset to check the block index (if one is found). If the
first index doesn't match, the code falls back to reading serially.
If it matches, the code uses the block index.

This access pattern should be more efficient (and a better match
to cloud access) where the old code did something like the following.

=== old routine ===

Assume we have a file with:
[ ---- tree --- padding blk0 blk1 ... blkN index ]

For old and new reading starts with the tree
fp index
|
[ ---- tree --- ********************************** ]

For old next the index was found which required a seek to the end
then reading backwards to find the index
|----------------------- seek --->|
[ ---- tree --- **************************** index ]

then a seek to the start to read the first block (after padding)
|<-seek-----------------|
[ ---- tree --- padding blk0 ************* index ]

then a seek to the last block and read it
---seek->|
[ ---- tree --- padding blk0 ******** blkN index ]

=== new routine ===

For the new routine the process is simplier and starts the same with
reading the tree:
fp index
|
[ ---- tree --- ********************************** ]

Then the first block is found (just the magic)
|
[ ---- tree --- padding ************************ ]

Then the file is searched for a block index (and read):
|-------------- seek --->|
[ ---- tree --- padding *******************index ]

and that's it.

One possible (and very unlikey) downside is if a file somehow has

a valid block index
a valid first block offset
an invalid last block offset

we would not catch the error (but we would before). It seems impossible
to generate such a file without manually modifying block headers. If
that's happening there is no reason non first/last blocks might also
be incorrect so the old routine was also not absolute.

Tasks

run prek on your machine
run pytest on your machine
Does this PR add new features and / or change user-facing code / API? (if not, label with no-changelog-entry-needed)
- write news fragment(s) in changes/: echo "changed something" > changes/<PR#>.<changetype>.rst (see below for change types)
- update relevant docstrings and / or docs/ page
- for any new features, add unit tests

news fragment change types...

changes/<PR#>.feature.rst: new feature
changes/<PR#>.bugfix.rst: bug fix
changes/<PR#>.doc.rst: documentation change
changes/<PR#>.removal.rst: deprecation or removal of public API
changes/<PR#>.general.rst: infrastructure or miscellaneous change

codspeed-hq · 2026-06-01T15:14:15Z

Merging this PR will not alter performance

✅ 122 untouched benchmarks

_{Comparing braingram:change_index_verification (1a41fe0) with main (b763db8)}

Prior to this commit the first and last blocks were loaded. Now, prior to the index finding the start of the first block is found The first blocks typically starts immediately after the tree. This is nice because the code just finished reading the tree meaning checking for the first block should usually only take reading 4 bytes to read the block magic. The exception is when the tree is padded. Here the entire padding needs to be read. Now that the code knows where the first block starts, it uses that offset to check the block index (if one is found). If the first index doesn't match, the code falls back to reading serially. If it matches, the code uses the block index. This access pattern should be more efficient (and a better match to cloud access) where the old code did something like the following. === old routine === Assume we have a file with: [ ---- tree --- _padding_ blk0 blk1 ... blkN index ] For old and new reading starts with the tree fp index | [ ---- tree --- ********************************** ] For old next the index was found which required a seek to the end then reading backwards to find the index |----------------------- seek --->| [ ---- tree --- **************************** index ] then a seek to the start to read the first block (after padding) |<-seek-----------------| [ ---- tree --- _padding_ blk0 ************* index ] then a seek to the last block and read it ---seek->| [ ---- tree --- _padding_ blk0 ******** blkN index ] === new routine === For the new routine the process is simplier and starts the same with reading the tree: fp index | [ ---- tree --- ********************************** ] Then the first block is found (just the magic) | [ ---- tree --- _padding_ ************************ ] Then the file is searched for a block index (and read): |-------------- seek --->| [ ---- tree --- _padding_ *******************index ] and that's it. One possible (and very unlikey) downside is if a file somehow has - a valid block index - a valid first block offset - an invalid last block offset we would not catch the error (but we would before). It seems impossible to generate such a file without manually modifying block headers. If that's happening there is no reason non first/last blocks might also be incorrect so the old routine was also not absolute.

sydduckworth

There's a chunk of code not being hit by unit tests, which is the logic described in #2062 that emits a warning if there's an empty block index but no blocks.
Should probably add tests to capture that behavior.

Otherwise LGTM

braingram · 2026-06-10T20:53:18Z

There's a chunk of code not being hit by unit tests, which is the logic described in #2062 that emits a warning if there's an empty block index but no blocks. Should probably add tests to capture that behavior.

Otherwise LGTM

Thanks! Good eye finding the uncovered code. I pushed a change in 2e43751 that:

makes the warning within that block consistent with the other warnings
updates the test_invalid_block_index test to cover that code (I noticed this test wasn't truncating the file after trashing the index and since truncation is a mess on windows I updated the test to use an in-memory buffer instead of writing to disk)

The updated test was causing issues with pyrefly. Part of this seemed to be an incorrect type for the return value of read_block_index so I updated that to be list[int] instead of list[int | None].

sydduckworth · 2026-06-10T21:31:42Z

The updated test was causing issues with pyrefly. Part of this seemed to be an incorrect type for the return value of read_block_index so I updated that to be list[int] instead of list[int | None].

Ah yeah I think I set the typing of that function because when constructing the block index the values can be None, and I hadn't yet found the logic that prevents writing out a block index that contains None values.

braingram added Downstream CI Benchmarks labels Jun 1, 2026

braingram force-pushed the change_index_verification branch 3 times, most recently from 6e9c45d to 7ffbed1 Compare June 1, 2026 20:04

braingram commented Jun 1, 2026

View reviewed changes

Comment thread asdf/_block/reader.py Outdated

braingram force-pushed the change_index_verification branch 2 times, most recently from fea3c41 to bde3686 Compare June 4, 2026 21:29

braingram marked this pull request as ready for review June 4, 2026 22:07

braingram requested a review from a team as a code owner June 4, 2026 22:07

braingram mentioned this pull request Jun 5, 2026

Reading file with block index containing no blocks raises exception #2062

Open

braingram added 5 commits June 8, 2026 14:27

update tests for new index reading

8cc8226

spelling

4908b6f

add changelog

8c0ebb4

appease pyrefly

257433f

braingram force-pushed the change_index_verification branch from bde3686 to 257433f Compare June 8, 2026 18:27

Merge branch 'main' into change_index_verification

232a614

sydduckworth reviewed Jun 10, 2026

View reviewed changes

braingram added 2 commits June 10, 2026 16:40

make warning consistent, update test for coverage

2e43751

appease type checker

1a41fe0

sydduckworth approved these changes Jun 10, 2026

View reviewed changes

braingram merged commit 9c89dc2 into asdf-format:main Jun 10, 2026
65 of 66 checks passed

braingram deleted the change_index_verification branch June 10, 2026 21:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change block index verification#2056

Change block index verification#2056
braingram merged 8 commits into
asdf-format:mainfrom
braingram:change_index_verification

braingram commented Jun 1, 2026

Uh oh!

codspeed-hq Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

sydduckworth left a comment

Uh oh!

braingram commented Jun 10, 2026

Uh oh!

sydduckworth commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

braingram commented Jun 1, 2026

Tasks

Uh oh!

codspeed-hq Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Uh oh!

Uh oh!

sydduckworth left a comment

Choose a reason for hiding this comment

Uh oh!

braingram commented Jun 10, 2026

Uh oh!

sydduckworth commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codspeed-hq Bot commented Jun 1, 2026 •

edited

Loading