[manager] write_location_manager callback failed by MMeecatfish · Pull Request #71 · alibaba/tair-kvcache

MMeecatfish · 2026-03-24T08:32:32Z

No description provided.

qoderai

👋 Review Summary

This PR improves how write timeouts are handled by propagating WriteLocationInfo through the callback, wiring it into CacheManager::FinishWriteCache, and tightening the tests around expiration and cleanup paths. The changes overall move the write timeout behavior in a safer and more observable direction.

🛡️ Key Risks & Issues

Write timeout callbacks in CacheManager::StartWriteCache now call FinishWriteCache with a non-null WriteLocationInfo instead of relying on GetAndDelete. This fixes the prior issue where expired sessions could not be finished, but it’s a behavior change: timeout now actively processes all blocks with succeed_block = 0. It would be good to double-check this aligns with upper-layer expectations for how timed-out writes should be reconciled and reclaimed, especially if callers previously assumed timeouts did not trigger delete work.
The new callback signature in WriteLocationManager looks structurally sound and improves ownership clarity by passing a unique_ptr<WriteLocationInfo> out of the internal map after erasure. I did not see new concurrency or lifetime hazards introduced by this pattern; the callbacks are invoked after the map unlocks, with exclusive ownership of the metadata.
In WriteLocationManagerTest.MultiThreadTest, the callback currently captures keys and ids by value while also passing them by std::move into Put, and it calls GetAndDelete("session_1", dummy) for every worker. This makes the assertions rely on unspecified evaluation order and only exercises a single hard-coded session in the cleanup call. I’ve left an inline comment with concrete suggestions to make this test deterministic and properly per-session.

🧪 Verification Advice

Add or extend a test that explicitly covers the timeout callback path in CacheManager::StartWriteCache → FinishWriteCache, ensuring that when the session expires and the callback fires, the passed WriteLocationInfo is used correctly and the reclaimer work is scheduled as intended.
Consider adding a scenario that uses a non-zero success mask at timeout (e.g., some blocks already written) to validate that FinishWriteCache’s block-level accounting and subsequent reclaim behavior remain correct under partial success.
If feasible in the existing test harness, add a small test or assertion around the new logging branches in ReclaimerTaskSupervisor::Submit to confirm that failures from schedule_plan_executor_->Submit are observable in logs during negative-path testing.

💡 Thoughts & Suggestions

The move to passing WriteLocationInfo via unique_ptr is a nice step toward making the ownership model explicit and avoiding double-consumption of session data. Once the MultiThreadTest lambda is cleaned up, these tests should give you a solid base for future evolutions of the expiration/cleanup mechanism.
For future maintenance, it might be worth centralizing the logic that decides what success_block_mask should be in timeout vs. normal completion paths, so the semantics stay consistent if additional write states or metrics are introduced later.

🤖 Generated by Qoder • View workflow run

qoderai · 2026-03-24T09:50:46Z

kv_cache_manager/manager/test/write_location_manager_test.cc

+                this->manager_.Put(session_id,
+                                   std::move(keys),
+                                   std::move(ids),
+                                   2,
+                                   [this, keys, ids, session_id](WriteLocationInfoPtr info) {


In MultiThreadTest, the lambda captures keys and ids by value while the same variables are passed as std::move(keys) / std::move(ids) into Put on lines 71–74. Because function argument evaluation order is unspecified, some compilers may evaluate the captures after the move, so the captured keys/ids end up empty and the ASSERT_EQ(keys, info->keys) / ASSERT_EQ(ids, info->location_ids) checks rely on unspecified behavior. It would be safer to either avoid std::move here or build separate vectors solely for the assertions.

Also, inside the same callback you call GetAndDelete("session_1", dummy) regardless of which session_id was used for this worker. That means this test is not actually validating per-session cleanup for the current worker and could hide issues with how different sessions are managed. Using the session_id parameter instead of a hard-coded "session_1" would better exercise the intended multi-session behavior.

🤖 Generated by Qoder • Fix in Qoder

[manager] write_location_manager callback failed

5b1ee81

qoderai bot reviewed Mar 24, 2026

View reviewed changes

github-actions bot added the ai reviewed AI has reviewed this PR label Mar 24, 2026

MMeecatfish merged commit aedeb8b into main Mar 24, 2026
9 of 10 checks passed

MMeecatfish deleted the fix/write_location_manager_callback_failed branch March 24, 2026 11:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[manager] write_location_manager callback failed#71

[manager] write_location_manager callback failed#71
MMeecatfish merged 1 commit intomainfrom
fix/write_location_manager_callback_failed

MMeecatfish commented Mar 24, 2026

Uh oh!

qoderai bot left a comment

Uh oh!

qoderai bot Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MMeecatfish commented Mar 24, 2026

Uh oh!

qoderai bot left a comment

Choose a reason for hiding this comment

👋 Review Summary

🛡️ Key Risks & Issues

🧪 Verification Advice

💡 Thoughts & Suggestions

Uh oh!

qoderai bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant