Skip to content

Optimize sharded safetensors metadata parsing: 3 HTTP requests → 2 per shard#1979

Open
mishig25 wants to merge 2 commits intomainfrom
optimize-sharded-safetensors-fetch
Open

Optimize sharded safetensors metadata parsing: 3 HTTP requests → 2 per shard#1979
mishig25 wants to merge 2 commits intomainfrom
optimize-sharded-safetensors-fetch

Conversation

@mishig25
Copy link
Collaborator

@mishig25 mishig25 commented Feb 16, 2026

Summary

  • Bypass downloadFile/fileDownloadInfo when fetching shard headers, using direct range requests instead
  • Reduces HTTP round-trips from 3 to 2 per shard (8 bytes for header length, then exact header content)
  • Rejects non-206 responses to avoid downloading full multi-GB shard bodies into memory

Benchmarks (avg of 10 runs each)

Model (shards) Old (3 req/shard) Optimized (2 req/shard) Change
bloom (72) 3,078ms 4,696ms* *outlier skew
sharded file path (4) 996ms 1,125ms ~same
sharded metadata 2,539ms 2,179ms 14% faster
gpt-oss-20b (3) 1,058ms 1,079ms ~same
Kimi-K2.5 (64) 3,602ms 2,385ms 34% faster
DeepSeek-Math-V2 (163) 5,006ms FAILED 10/10 3,759ms Fixed
Qwen3.5-397B (94) 2,790ms (1/10 fail) 2,058ms 26% faster

* bloom optimized avg skewed by one 10.8s outlier run (min was 1,581ms vs old min 2,967ms)

Key finding: DeepSeek-Math-V2 (163 shards) fails 100% with old code — 489 HTTP requests (163 × 3) overwhelm the server. The optimized path (163 × 2 = 326 requests) handles it reliably.

Test plan

  • All 17 existing + new tests pass (10/10 runs)
  • Verify with private/gated repos (auth header passthrough)

🤖 Generated with Claude Code

Previously, each shard required 3 HTTP requests (fileDownloadInfo + header
length + header content). This replaces that with 2 direct range requests
(8 bytes for length, then exact header bytes), bypassing downloadFile/
fileDownloadInfo entirely for sharded models.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mishig25 mishig25 requested a review from coyotte508 as a code owner February 16, 2026 11:49
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

…ll shard bodies

Cancel the response body and throw a clear error when the server returns
200 instead of 206, which would otherwise attempt to load multi-GB shard
files into memory.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mishig25 mishig25 requested a review from gary149 February 16, 2026 11:54
Copy link
Member

@coyotte508 coyotte508 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note this means that we're downloading from the xet bridge, instead of downloading from xet backend directly (xet-read-token + reconstructionInfo + cas requests)

maybe faster because of the CDN in front of the bridge? or the bridge being closer to the xet backend? idk

anyway up to you (whether to merge) - cc @XciD for fiz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants