Skip to content

Change block index verification#2056

Merged
braingram merged 8 commits into
asdf-format:mainfrom
braingram:change_index_verification
Jun 10, 2026
Merged

Change block index verification#2056
braingram merged 8 commits into
asdf-format:mainfrom
braingram:change_index_verification

Conversation

@braingram

Copy link
Copy Markdown
Contributor

Prior to this PR the first and last blocks were loaded.

Now, prior to the index finding the start of the first block is found
The first blocks typically starts immediately after the tree.
This is nice because the code just finished reading the tree meaning
checking for the first block should usually only take reading 4 bytes
to read the block magic. The exception is when the tree is padded.
Here the entire padding needs to be read.

Now that the code knows where the first block starts, it uses
that offset to check the block index (if one is found). If the
first index doesn't match, the code falls back to reading serially.
If it matches, the code uses the block index.

This access pattern should be more efficient (and a better match
to cloud access) where the old code did something like the following.

=== old routine ===

Assume we have a file with:
[ ---- tree --- padding blk0 blk1 ... blkN index ]

For old and new reading starts with the tree
fp index
|
[ ---- tree --- ********************************** ]

For old next the index was found which required a seek to the end
then reading backwards to find the index
|----------------------- seek --->|
[ ---- tree --- **************************** index ]

then a seek to the start to read the first block (after padding)
|<-seek-----------------|
[ ---- tree --- padding blk0 ************* index ]

then a seek to the last block and read it
---seek->|
[ ---- tree --- padding blk0 ******** blkN index ]

=== new routine ===

For the new routine the process is simplier and starts the same with
reading the tree:
fp index
|
[ ---- tree --- ********************************** ]

Then the first block is found (just the magic)
|
[ ---- tree --- padding ************************ ]

Then the file is searched for a block index (and read):
|-------------- seek --->|
[ ---- tree --- padding *******************index ]

and that's it.

One possible (and very unlikey) downside is if a file somehow has

  • a valid block index
  • a valid first block offset
  • an invalid last block offset

we would not catch the error (but we would before). It seems impossible
to generate such a file without manually modifying block headers. If
that's happening there is no reason non first/last blocks might also
be incorrect so the old routine was also not absolute.

Tasks

  • run prek on your machine
  • run pytest on your machine
  • Does this PR add new features and / or change user-facing code / API? (if not, label with no-changelog-entry-needed)
    • write news fragment(s) in changes/: echo "changed something" > changes/<PR#>.<changetype>.rst (see below for change types)
    • update relevant docstrings and / or docs/ page
    • for any new features, add unit tests
news fragment change types...
  • changes/<PR#>.feature.rst: new feature
  • changes/<PR#>.bugfix.rst: bug fix
  • changes/<PR#>.doc.rst: documentation change
  • changes/<PR#>.removal.rst: deprecation or removal of public API
  • changes/<PR#>.general.rst: infrastructure or miscellaneous change

@codspeed-hq

codspeed-hq Bot commented Jun 1, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 122 untouched benchmarks


Comparing braingram:change_index_verification (1a41fe0) with main (b763db8)

Open in CodSpeed

@braingram braingram force-pushed the change_index_verification branch 3 times, most recently from 6e9c45d to 7ffbed1 Compare June 1, 2026 20:04
Comment thread asdf/_block/reader.py Outdated
@braingram braingram force-pushed the change_index_verification branch 2 times, most recently from fea3c41 to bde3686 Compare June 4, 2026 21:29
@braingram braingram marked this pull request as ready for review June 4, 2026 22:07
@braingram braingram requested a review from a team as a code owner June 4, 2026 22:07
braingram added 5 commits June 8, 2026 14:27
Prior to this commit the first and last blocks were loaded.

Now, prior to the index finding the start of the first block is found
The first blocks typically starts immediately after the tree.
This is nice because the code just finished reading the tree meaning
checking for the first block should usually only take reading 4 bytes
to read the block magic. The exception is when the tree is padded.
Here the entire padding needs to be read.

Now that the code knows where the first block starts, it uses
that offset to check the block index (if one is found). If the
first index doesn't match, the code falls back to reading serially.
If it matches, the code uses the block index.

This access pattern should be more efficient (and a better match
to cloud access) where the old code did something like the following.

=== old routine ===

Assume we have a file with:
[ ---- tree --- _padding_ blk0 blk1 ... blkN index ]

For old and new reading starts with the tree
             fp index
                |
[ ---- tree --- ********************************** ]

For old next the index was found which required a seek to the end
then reading backwards to find the index
                |----------------------- seek --->|
[ ---- tree --- **************************** index ]

then a seek to the start to read the first block (after padding)
                          |<-seek-----------------|
[ ---- tree --- _padding_ blk0 ************* index ]

then a seek to the last block and read it
                              ---seek->|
[ ---- tree --- _padding_ blk0 ******** blkN index ]

=== new routine ===

For the new routine the process is simplier and starts the same with
reading the tree:
             fp index
                |
[ ---- tree --- ********************************** ]

Then the first block is found (just the magic)
                          |
[ ---- tree --- _padding_ ************************ ]

Then the file is searched for a block index (and read):
                          |-------------- seek --->|
[ ---- tree --- _padding_ *******************index ]

and that's it.

One possible (and very unlikey) downside is if a file somehow has
- a valid block index
- a valid first block offset
- an invalid last block offset

we would not catch the error (but we would before). It seems impossible
to generate such a file without manually modifying block headers. If
that's happening there is no reason non first/last blocks might also
be incorrect so the old routine was also not absolute.
@braingram braingram force-pushed the change_index_verification branch from bde3686 to 257433f Compare June 8, 2026 18:27

@sydduckworth sydduckworth left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a chunk of code not being hit by unit tests, which is the logic described in #2062 that emits a warning if there's an empty block index but no blocks.
Should probably add tests to capture that behavior.

Otherwise LGTM

@braingram

Copy link
Copy Markdown
Contributor Author

There's a chunk of code not being hit by unit tests, which is the logic described in #2062 that emits a warning if there's an empty block index but no blocks. Should probably add tests to capture that behavior.

Otherwise LGTM

Thanks! Good eye finding the uncovered code. I pushed a change in 2e43751 that:

  • makes the warning within that block consistent with the other warnings
  • updates the test_invalid_block_index test to cover that code (I noticed this test wasn't truncating the file after trashing the index and since truncation is a mess on windows I updated the test to use an in-memory buffer instead of writing to disk)

The updated test was causing issues with pyrefly. Part of this seemed to be an incorrect type for the return value of read_block_index so I updated that to be list[int] instead of list[int | None].

@sydduckworth

Copy link
Copy Markdown
Member

The updated test was causing issues with pyrefly. Part of this seemed to be an incorrect type for the return value of read_block_index so I updated that to be list[int] instead of list[int | None].

Ah yeah I think I set the typing of that function because when constructing the block index the values can be None, and I hadn't yet found the logic that prevents writing out a block index that contains None values.

@braingram braingram merged commit 9c89dc2 into asdf-format:main Jun 10, 2026
65 of 66 checks passed
@braingram braingram deleted the change_index_verification branch June 10, 2026 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants