Index routing currently calls load_shard() for every shard just to read centroids, so the first query deserializes the full index into RAM before nprobe selection. This defeats the documented lazy-loading design, inflates cold-start latency, and makes memory usage scale with full index size instead of the probed working set.
Desired direction:
- separate centroid routing metadata from full shard bodies
- choose probe shards without loading every shard payload
- keep full shard bodies lazily loaded only for the selected probe set
- version any manifest or binary format changes explicitly
- update docs when formats change
Acceptance criteria:
- IndexSearcher can route without load_shard() on every shard
- first query only deserializes probed shard bodies
- tests verify non-probed shards are not loaded during routing
- docs updated for any schema/artifact changes
- benchmarking or instrumentation can confirm cold-start improvement