Skip to content

Fix NullReferenceException in object store scan during slot migration when tombstones are present#1596

Draft
Copilot wants to merge 3 commits intomainfrom
copilot/fix-slot-migration-failure
Draft

Fix NullReferenceException in object store scan during slot migration when tombstones are present#1596
Copilot wants to merge 3 commits intomainfrom
copilot/fix-slot-migration-failure

Conversation

Copy link
Contributor

Copilot AI commented Mar 3, 2026

Automatic slot migration crashes with NullReferenceException in ObjectStoreScan.SingleReader when object store keys with TTLs or explicit deletes create tombstone records during migration. The tombstone's IGarnetObject value is null, and ClusterSession.Expired(ref value) dereferences it unconditionally.

Root cause

MigrateOperation.Scan passed includeTombstones: true to both IterateMainStore and IterateObjectStore. Object store tombstones have value = null (set by ConcurrentDeleter / default-initialized in CreateNewRecordDelete). When the scan handed these records to ObjectStoreScan.SingleReader, the crash occurred:

NullReferenceException at ClusterSession.Expired(IGarnetObject& value)
  at MigrateSession.ObjectStoreScan.SingleReader(...)

Tombstones were never actually migrated — they would have been skipped as NOTFOUND during transmission — so includeTombstones: true had no effect other than causing crashes.

Changes

  • libs/cluster/Server/Migration/MigrateOperation.cs — Remove includeTombstones: true from both IterateMainStore and IterateObjectStore calls in Scan(), reverting to the default false. Deleted keys have no data to migrate; excluding tombstones is both correct and safe.

  • test/Garnet.test.cluster/ClusterMigrateTests.cs — Add regression test ClusterMigrateSlotsWithObjectTombstones. The test writes string keys to the source node's main store first (required because the object store scan range is bounded by storeTailAddress — the main store's tail address — so tombstones must fall within that range to trigger the bug), then creates a sorted set, deletes it to produce an object store tombstone, migrates the slot, and asserts the live key arrives on the target.

Original prompt

This section details on the original issue you should resolve

<issue_title>Automatic Slot Migrator Failure (System.NullReferenceException in CreateAndRunMigrateTasks and MigrateSession.RecoverFromFailure failed to make slots STABLE)</issue_title>
<issue_description>### Describe the bug

When running garnet, the Automatic Slot Migration fails.

Cluster: A two-node (2 shards, no replicas) cluster. The cluster is initialized with all slots being on one node with 5-10 thousand keys, and then we try to migrate some hash slots from one primary/shard to another while continuosly writing data and deleting keys (both manually and using TTLs). There are some short TTL keys that are being deleted and modified consistently.

Garnet Version: 1.0.94

Network: IPv6 with TLS

Steps to reproduce the bug

Command:

/usr/bin/valkey-cli -h primary-01.foo.bar -p 6379 --user myuser -a mypassword --tls --cacert /path/to/ca.crt MIGRATE ipv6-address-of-primary-02 6379 "" 0 -1 REPLACE AUTH2 myuser 'mypassword' SLOTSRANGE 8192 16383

When the above command is executed, the sender says that's migrating slots to the target but, it throws an error in the logs and the cluster_slots_ok and cluster_slots_assigned are both decreased (by the number of slots being migrated).

When you look at CLUSTER MTASKS, it says that it's 0. It should say 1 during the migration.

Sender's Logs:

2026-02-13T06:24:04.987917+00:00 primary-02 GarnetServer[216135]: 06::24::04 fail: MigrateSession - 14333193[0] CreateAndRunMigrateTasks: Object 24 240 4096 System.NullReferenceException: Object reference not set to an instance of an object.    at Garnet.cluster.ClusterSession.Expired(IGarnetObject& value) in /_/libs/cluster/Session/MigrateCommand.cs:line 18    at Garnet.cluster.MigrateSession.ObjectStoreScan.SingleReader(Byte[]& key, IGarnetObject& value, RecordMetadata recordMetadata, Int64 numberOfRecords, CursorRecordResult& cursorRecordResult) in /_/libs/cluster/Server/Migration/MigrateScanFunctions.cs:line 78    at Tsavorite.core.AllocatorBase`4.ScanLookup[TInput,TOutput,TScanFunctions,TScanIterator](TsavoriteKV`4 store, ScanCursorState`2 scanCursorState, Int64& cursor, Int64 count, TScanFunctions scanFunctions, TScanIterator iter, Boolean validateCursor, Int64 maxAddress, Boolean resetCursor, Boolean includeTombstones) in /_/libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorScan.cs:line 197    at Tsavorite.core.GenericAllocatorImpl`3.ScanCursor[TScanFunctions](TsavoriteKV`4 store, ScanCursorState`2 scanCursorState, Int64& cursor, Int64 count, TScanFunctions scanFunctions, Int64 endAddress, Boolean validateCursor, Int64 maxAddress, Boolean resetCursor, Boolean includeTombstones) in /_/libs/storage/Tsavorite/cs/src/core/Allocator/GenericAllocatorImpl.cs:line 1034    at Tsavorite.core.ClientSession`8.ScanCursor[TScanFunctions](Int64& cursor, Int64 count, TScanFunctions scanFunctions, Int64 endAddress, Boolean validateCursor, Int64 maxAddress, Boolean resetCursor, Boolean includeTombstones) in /_/libs/storage/Tsavorite/cs/src/core/ClientSession/ClientSession.cs:line 503    at Tsavorite.core.ClientSession`8.IterateLookup[TScanFunctions](TScanFunctions& scanFunctions, Int64& cursor, Int64 untilAddress, Boolean validateCursor, Int64 maxAddress, Boolean resetCursor, Boolean includeTombstones) in /_/libs/storage/Tsavorite/cs/src/core/ClientSession/ClientSession.cs:line 477    at Garnet.server.StorageSession.IterateObjectStore[TScanFunctions](TScanFunctions& scanFunctions, Int64& cursor, Int64 untilAddress, Int64 maxAddress, Boolean validateCursor, Boolean includeTombstones) in /_/libs/server/Storage/Session/Common/ArrayKeyIterationFunctions.cs:line 172    at Garnet.server.GarnetApi`2.IterateObjectStore[TScanFunctions](TScanFunctions& scanFunctions, Int64& cursor, Int64 untilAddress, Int64 maxAddress, Boolean includeTombstones) in /_/libs/server/API/GarnetApi.cs:line 463    at Garnet.cluster.MigrateSession.MigrateOperation.Scan(StoreType storeType, Int64& currentAddress, Int64 endAddress) in /_/libs/cluster/Server/Migration/MigrateOperation.cs:line 67    at Garnet.cluster.MigrateSession.<>c__DisplayClass61_0.<MigrateSlotsDriverInline>g__ScanStoreTask|1(Int32 taskId, StoreType storeType, Int64 beginAddress, Int64 tailAddress, Int32 pageSize) in /_/libs/cluster/Server/Migration/MigrateSessionSlots.cs:line 93    at Garnet.cluster.MigrateSession.<>c__DisplayClass61_2.<MigrateSlotsDriverInline>b__2() in /_/libs/cluster/Server/Migration/MigrateSessionSlots.cs:line 57    at System.Threading.Tasks.Task`1.InnerInvoke()    at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state) --- End of stack trace from previous location ---    at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, Execu...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes microsoft/garnet#1560

<!-- START COPILOT CODING AGENT TIPS -->
---

✨ Let Copilot coding agent [set things up for you](https://github.com/microsoft/garnet/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits March 3, 2026 21:01
…ombstones

Remove includeTombstones: true from MigrateOperation.Scan. Tombstone records
in the object store have null IGarnetObject values, which caused a crash in
ObjectStoreScan.SingleReader when passed to ClusterSession.Expired(ref value).
Tombstones represent deleted keys with no data to migrate, so excluding them
is both correct and safe. Add regression test ClusterMigrateSlotsWithObjectTombstones.

Co-authored-by: vazois <96085550+vazois@users.noreply.github.com>
Co-authored-by: vazois <96085550+vazois@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix automatic slot migrator failure in garnet Fix NullReferenceException in object store scan during slot migration when tombstones are present Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants