From 93f799ed9bf16896bef69587d39b54dc9cc51a04 Mon Sep 17 00:00:00 2001 From: ekexium Date: Thu, 15 Aug 2024 23:09:29 +0800 Subject: [PATCH 1/8] rfc: Resolved-ts for Large Transactions Signed-off-by: ekexium --- ...0114-resolved-ts-for-large-transactions.md | 105 ++++++++++++++++++ 1 file changed, 105 insertions(+) create mode 100644 text/0114-resolved-ts-for-large-transactions.md diff --git a/text/0114-resolved-ts-for-large-transactions.md b/text/0114-resolved-ts-for-large-transactions.md new file mode 100644 index 00000000..94ce0488 --- /dev/null +++ b/text/0114-resolved-ts-for-large-transactions.md @@ -0,0 +1,105 @@ +# Resolved-ts for Large Transactions + +Author: @ekexium + +Tracking issue: N/A + +## Background + +The RFC is a variation of @zhangjinpeng87 's [Large Transactions Don't Block Watermark](https://github.com/pingcap/tiflow/blob/master/docs/design/2024-01-22-ticdc-large-txn-not-block-wm.md). They aim to solve the same problem. + +Resolved-ts is a tool for other services. It's definition is that no new commit records smaller than the resolved-ts will be observed after you observe the resolved-ts. + +In current TiKV(v8.3), large transactions can block resolve-ts from advancing, because it is calculated as `min(pd-tso, min(lock.ts))`, which is actually a more stringent constraint than its original definition. A lock from a pipelined txn can live several hours. This will make services dependent on resolved-ts unavailable. + +## Goals + +Do not let **large pipelined transactions** block the advance of resolved-ts. + +We focus on large pipelined transactions here. It could be adapted for general "large" transactions. + +## Assumptions + +We assume that the number of concurrent pipelined transactions is bounded, not exceeding 10000, for example. + +This constraint is not a strict limit, but rather serves to manage resource utilization and facilitate performance evaluation. 10000 should be large enough in real world. + +## Design + +The key idea is using `lock.min_commit_ts` to calculate resolved-ts instead of `lock.start_ts`. + +A resolved-ts guarantees that all historical events prior to this timestamp are finalized and observable. 'Historical events' in this context refer specifically to write records and rollback records, but explicitly exclude locks. It's important to note that the absence of locks with earlier timestamps is not a requirement for a valid resolved-ts, as long as the status of their corresponding transactions is definitively determined. + +### Maintanence of resolved-ts + +Key objective: Maximize TiKV nodes' awareness of large pipelined transactions, including: + +1. start_ts +2. Recent min_commit_ts +3. Status + +#### Coordinator + +For a large pipelined transaction, its TTL manager is responsible for fetching a latest TSO as a candidate of min_commit_ts and update both the committer's inner state and PK in TiKV. After that, it broadcast the start_ts and the new min_commit_ts to all TiKV stores. The update of PK can be done within the heartbeat request. + +Atomic variables or locks may be needed for synchronization between the TTL manager and the committer. + +#### TiKV scheduler - heartbeat + +Besides updating TTL, it can also update min_commit_ts of the PK. + +*TBD: should it also update max_ts?* + +#### TiKV txn_status_cache + +A standalone part was created for large transactions specially. The cache serves as + +1. A fresh enough source of min_commit_ts info of large transactions for resolved-ts resolver +2. A fast path for read requests when they would otherwise return to coordinator to check PK's min_commit_ts. + +##### Eviction + +We would keep as much useful info as possible in the cache, and never evict any of them because of space issue. One entry only contains information like start_ts + min_commit_ts + status + TTL, which should make the cache small enough, considering our assumption of the number of ongoing large transactions. + +There should be a large defaut TTL of these entries, because we want to save unnecessary efforts when some reader meets a lock belonging to these transactions. + +After the successfully commiting all secondary locks of a large transaction, the coordinator explicitly broadcasts a TTL update to all TiKV nodes, extending it to several seconds later. Don't immediately evict the entry to give the follower peers some time to catch up with leader, otherwise a stale read may encounter a lock and miss the cache. + +#### TiKV resolved-ts resolver + +Resolver tracks normal locks as usual, but handles locks belonging to large pipelined transactions in a different way. The locks can be identified via the "generation" field. + +For a lock belonging to a large pipelined transaction, the resolve only tracks its start_ts. When calculating resolved-ts, the resolver first tries to map the start_ts to its min_commit_ts by querying the txn_status_cache. If not found in cache, fallback to calculate using start_ts. + +Upon observing a LOCK DELETION, the resolver ceases tracking the corresponding start_ts for large pipelined transactions. This is justified as lock deletion only occurs once a transaction's final state is determined. + +### Compatibility + +The key difference is that services can now observe locks. They need to handle the locks. + +#### Stale read + +When it meets a lock, first query the txn_status_cache. When not found in the cache, fallback to leader read. + +#### Flashback + +*TBD* + +#### EBS snapshot backups + +*TBD* + +#### CDC + +Already well documented in [Large Transactions Don't Block Watermark](https://github.com/pingcap/tiflow/blob/master/docs/design/2024-01-22-ticdc-large-txn-not-block-wm.md). Briefly, a refactoring work is needed. + +### Cost + +Memory: each cache entry takes at least 8(start_ts) + 8(min_commit_ts) + 1(status) + 8(TTL) = 33 bytes. Any TiKV instance can easily hold millions of such entries. + +Latency: maintenance of resolved-ts requires extra work, but they can be asynchoronous, thus not affecting latency. + +RPCs: each large transaction sends N more RPCs per second, where N is the number of TiKVs. + +CPU: the mechanism may consume more CPU, but should be ignorable. + From 4fa64e3b1daa35a12b818d4d8087565cdb3cc2ca Mon Sep 17 00:00:00 2001 From: ekexium Date: Fri, 23 Aug 2024 09:41:11 +0800 Subject: [PATCH 2/8] discuss normal locks Signed-off-by: ekexium --- ...0114-resolved-ts-for-large-transactions.md | 77 +++++++++++++++++-- 1 file changed, 70 insertions(+), 7 deletions(-) diff --git a/text/0114-resolved-ts-for-large-transactions.md b/text/0114-resolved-ts-for-large-transactions.md index 94ce0488..66499806 100644 --- a/text/0114-resolved-ts-for-large-transactions.md +++ b/text/0114-resolved-ts-for-large-transactions.md @@ -14,9 +14,9 @@ In current TiKV(v8.3), large transactions can block resolve-ts from advancing, b ## Goals -Do not let **large pipelined transactions** block the advance of resolved-ts. +In current phase, our primary goal is to not let **large pipelined transactions** block the advance of resolved-ts. We focus on large pipelined transactions here. It could be adapted for general "large" transactions. -We focus on large pipelined transactions here. It could be adapted for general "large" transactions. +Our ultimate goal is to achieve an unblocked resolved-ts progression. Besides long transactions and their locks, there are other factors that can block the advance of resolved-ts. We will discuss it in the last part of the proposal. ## Assumptions @@ -28,11 +28,11 @@ This constraint is not a strict limit, but rather serves to manage resource util The key idea is using `lock.min_commit_ts` to calculate resolved-ts instead of `lock.start_ts`. -A resolved-ts guarantees that all historical events prior to this timestamp are finalized and observable. 'Historical events' in this context refer specifically to write records and rollback records, but explicitly exclude locks. It's important to note that the absence of locks with earlier timestamps is not a requirement for a valid resolved-ts, as long as the status of their corresponding transactions is definitively determined. +A resolved timestamp (resolved-ts) ensures that all historical events before this point are finalized and observable. In this context, 'historical events' specifically mean write and rollback records, excluding locks in the LOCK CF. Importantly, a valid resolved-ts doesn't require the absence of earlier locks, as long as their transactions' status is determined. ### Maintanence of resolved-ts -Key objective: Maximize TiKV nodes' awareness of large pipelined transactions, including: +Key objective: Maximize all TiKV nodes' awareness of large pipelined transactions during their lifetime, i.e. from their first writes to all locks being committed. These info are necessary: 1. start_ts 2. Recent min_commit_ts @@ -44,6 +44,12 @@ For a large pipelined transaction, its TTL manager is responsible for fetching a Atomic variables or locks may be needed for synchronization between the TTL manager and the committer. +#### Scaling out TiKVs + +When a new TiKV instance is added to the cluster in the middle of a large transaction, its TTL manager must broadcast to it in time. TTL manager gets the list of stores from the region cache. If region cache is unaware of any newly up TiKV, TTL manager may miss it. + +To mitigate this, we propose implementing an optional routine in the region cache to periodically fetch all stores. + #### TiKV scheduler - heartbeat Besides updating TTL, it can also update min_commit_ts of the PK. @@ -69,13 +75,19 @@ After the successfully commiting all secondary locks of a large transaction, the Resolver tracks normal locks as usual, but handles locks belonging to large pipelined transactions in a different way. The locks can be identified via the "generation" field. -For a lock belonging to a large pipelined transaction, the resolve only tracks its start_ts. When calculating resolved-ts, the resolver first tries to map the start_ts to its min_commit_ts by querying the txn_status_cache. If not found in cache, fallback to calculate using start_ts. +For locks in large pipelined transactions, the resolver only tracks the start_ts. When calculating resolved-ts, it first attempts to map start_ts to min_commit_ts via the txn_status_cache. To maintain semantics, resolved-ts must be at least min_commit_ts + 1. If the cache lookup fails, it falls back to using start_ts for calculation. Upon observing a LOCK DELETION, the resolver ceases tracking the corresponding start_ts for large pipelined transactions. This is justified as lock deletion only occurs once a transaction's final state is determined. +### Benefits in resolving locks + +Across all lock resolution scenarios—including normal reads, stale reads, flashbacks, and potentially write conflicts—a preliminary txn_status_cache lookup can significantly reduce unnecessary computational overhead introduced by large transactions. + ### Compatibility -The key difference is that services can now observe locks. They need to handle the locks. +The key difference is that services can now observe much more locks. + +Note that the current implementation still allows encountering locks with timestamps smaller than the resolved timestamp. This proposal doesn't change this behavior, so we don't anticipate correctness issues with this change. The main challenges will be related to performance and availability. #### Stale read @@ -83,12 +95,16 @@ When it meets a lock, first query the txn_status_cache. When not found in the ca #### Flashback -*TBD* +1. Compatilibity with CDC: Flashback will write a lock to block resolved-ts during its execution. It does not use pipelined transaction so this lock will be treated as a normal lock. + +2. The current and previous (up to v8.3) implementations of Flashback in TiKV rely on an incorrect assumption about resolved-ts guarantees. This misconception can lead to critical issues, such as the potential violation of transaction atomicity, as documented in https://github.com/tikv/tikv/issues/17415. #### EBS snapshot backups *TBD* +It depends on Flashback. + #### CDC Already well documented in [Large Transactions Don't Block Watermark](https://github.com/pingcap/tiflow/blob/master/docs/design/2024-01-22-ticdc-large-txn-not-block-wm.md). Briefly, a refactoring work is needed. @@ -103,3 +119,50 @@ RPCs: each large transaction sends N more RPCs per second, where N is the number CPU: the mechanism may consume more CPU, but should be ignorable. + + +## Possible future improvements + +#### Tolerate lagging non-pipelined transactions + +To get closer to our ultimate goal: minimize blocking of resolved-ts, we can further consider the case where resolved-ts being blocked by normal transaction locks. Typical causes could be: + +- Memory locks from async commit and 1PC. Normal locks are region-partitioned can will not block resolved-ts of other regions. But concurrenty manager is a node-level instance. Memory locks can block every (leader) region in the same TiKV. +- Slow transactions which take too much time committing their locks +- Long-running transactions that may not be large. +- Node failures + + + +Resolved-ts must continuously progress. However, it can't advance autonomously while ignoring locks. Such advancement would require the commit PK operation to either complete before the resolved-ts reaches a certain point or fail. This guarantee is not feasible. + +The left approach feasible to prevent resolved-ts blocked by normal transactions are actively pushing their min_commit_ts, similar to what is done to large transactions. + +However, locks using async commit cannot be pushed. + +To sum up, when a resolver meets a lock whose min_commit_ts still blocks its + +- Check the cache + - Found if T.min_commit_ts >= R_TS candidate -> skip the lock + - Else, fallthrough + +- 2PC locks, check_txn_status and try to push its min_commit_ts. + - Committed -> return its commit_ts + - Commit_ts > R_TS candidate -> skip the lock + - Commit_ts < R_TS candidate -> block at commit_ts - 1. + - Min commit ts pushed, or min_commit_ts > R_TS candidate -> skip the lock + - Rolled back -> skip the lock + - Else -> block at min_commit_ts - 1 +- Async commit locks -> check its status + - Committed, same as 2PC locks + - Rolled back -> skip the lock + - Else if min_commit_ts > R_TS candidate -> skip the lock + - Else -> block at min_commit_ts + 1 + +Locks belonging to the same transaction can be consolidated. + +To mitigate uncontrollable overhead and metastability risks, we limit our check to K transactions per region with the lowest min_commit_ts values. This approach is necessary given the potentially substantial total number of transactions. + +#### Reduce write-read conflicts + +Read requests typically require a check_txn_status to advance the min_commit_ts. We propose allowing large transactions to set their min_commit_ts to a higher value, potentially exceeding the current TSO. These min_commit_ts values, stored in the txn_status_cache, would enable read requests encountering locks to bypass them via a cache lookup. Large transactions would cease this special min_commit_ts setting once ready for prewrite. From 813b856f4222cfc0d38b03705d70ea557e877a7e Mon Sep 17 00:00:00 2001 From: ekexium Date: Tue, 27 Aug 2024 19:21:22 +0800 Subject: [PATCH 3/8] refine Signed-off-by: ekexium --- ...0114-resolved-ts-for-large-transactions.md | 123 ++++++++++++------ 1 file changed, 80 insertions(+), 43 deletions(-) diff --git a/text/0114-resolved-ts-for-large-transactions.md b/text/0114-resolved-ts-for-large-transactions.md index 66499806..8c5799ac 100644 --- a/text/0114-resolved-ts-for-large-transactions.md +++ b/text/0114-resolved-ts-for-large-transactions.md @@ -6,53 +6,61 @@ Tracking issue: N/A ## Background -The RFC is a variation of @zhangjinpeng87 's [Large Transactions Don't Block Watermark](https://github.com/pingcap/tiflow/blob/master/docs/design/2024-01-22-ticdc-large-txn-not-block-wm.md). They aim to solve the same problem. +The RFC is a variation and extension of @zhangjinpeng87 's [Large Transactions Don't Block Watermark](https://github.com/pingcap/tiflow/blob/master/docs/design/2024-01-22-ticdc-large-txn-not-block-wm.md). They aim to solve the same problem. -Resolved-ts is a tool for other services. It's definition is that no new commit records smaller than the resolved-ts will be observed after you observe the resolved-ts. +Resolved-ts is a mechanism that provides a temporal guarantee. It is defined as follows: Once a resolved timestamp T is observed by any component of the system, it is guaranteed that no new commit record with a commit timestamp less than or equal to T will be subsequently produced or observed by any component of the system. In current TiKV(v8.3), large transactions can block resolve-ts from advancing, because it is calculated as `min(pd-tso, min(lock.ts))`, which is actually a more stringent constraint than its original definition. A lock from a pipelined txn can live several hours. This will make services dependent on resolved-ts unavailable. ## Goals -In current phase, our primary goal is to not let **large pipelined transactions** block the advance of resolved-ts. We focus on large pipelined transactions here. It could be adapted for general "large" transactions. +- Current phase objectives: + - Prevent pipelined transactions from impeding resolved-ts advancement. -Our ultimate goal is to achieve an unblocked resolved-ts progression. Besides long transactions and their locks, there are other factors that can block the advance of resolved-ts. We will discuss it in the last part of the proposal. - -## Assumptions - -We assume that the number of concurrent pipelined transactions is bounded, not exceeding 10000, for example. - -This constraint is not a strict limit, but rather serves to manage resource utilization and facilitate performance evaluation. 10000 should be large enough in real world. +- Long-term goal: + - Minimize the overhead of resolved-ts maintenance. + - Maximize resolved-ts freshness. + - Achieve uninterrupted resolved-ts progression, addressing all potential blocking factors beyond long transactions and their associated locks. (Further details to be discussed in the final section of this proposal.) ## Design -The key idea is using `lock.min_commit_ts` to calculate resolved-ts instead of `lock.start_ts`. +Key Concept: Utilizing `lock.min_commit_ts` instead of `lock.start_ts` for resolved-ts calculation. + +Rationale: +1. Resolved-ts definition is independent of LOCK CF. +2. Current use of `lock.start_ts` is based on the invariant: `lock.start_ts` < `lock.commit_ts`. +3. A valid resolved-ts doesn't necessitate the absence of locks with smaller start_ts, provided their future commit_ts are guaranteed to be larger. -A resolved timestamp (resolved-ts) ensures that all historical events before this point are finalized and observable. In this context, 'historical events' specifically mean write and rollback records, excluding locks in the LOCK CF. Importantly, a valid resolved-ts doesn't require the absence of earlier locks, as long as their transactions' status is determined. +Advantages of `lock.min_commit_ts`: +1. Satisfies a similar invariant: `lock.min_commit_ts` <= `lock.commit_ts` +2. Unlike the static `lock.start_ts`, `lock.min_commit_ts` can be dynamically increased. ### Maintanence of resolved-ts Key objective: Maximize all TiKV nodes' awareness of large pipelined transactions during their lifetime, i.e. from their first writes to all locks being committed. These info are necessary: -1. start_ts -2. Recent min_commit_ts +1. `start_ts` +2. A fresh enough `min_commit_ts` 3. Status #### Coordinator -For a large pipelined transaction, its TTL manager is responsible for fetching a latest TSO as a candidate of min_commit_ts and update both the committer's inner state and PK in TiKV. After that, it broadcast the start_ts and the new min_commit_ts to all TiKV stores. The update of PK can be done within the heartbeat request. +For a pipelined transaction, its TTL manager is responsible for fetching a latest TSO as a candidate of min_commit_ts and update both the committer's inner state and PK in TiKV. After that, it broadcasts the `start_ts` and the new `min_commit_ts` to all TiKV stores. The PK update can be piggybacked on the `heartbeat` request. + +Optionally, to avoid too many RPC overhead, the broadcast messages from different transactions can be batched. Atomic variables or locks may be needed for synchronization between the TTL manager and the committer. #### Scaling out TiKVs -When a new TiKV instance is added to the cluster in the middle of a large transaction, its TTL manager must broadcast to it in time. TTL manager gets the list of stores from the region cache. If region cache is unaware of any newly up TiKV, TTL manager may miss it. +Challenge: +During cluster expansion, when a new TiKV instance is integrated mid-transaction, the TTL manager must promptly incorporate it into its broadcast list. The TTL manager relies on the region cache for store information. However, if the region cache lacks awareness of a newly added TiKV, the TTL manager may inadvertently omit it from broadcasts. -To mitigate this, we propose implementing an optional routine in the region cache to periodically fetch all stores. +One solution is to have an background goroutine in the region cache to periodically refresh the complete store list. #### TiKV scheduler - heartbeat -Besides updating TTL, it can also update min_commit_ts of the PK. +Besides updating TTL, it also supports update min_commit_ts of the PK. *TBD: should it also update max_ts?* @@ -60,24 +68,55 @@ Besides updating TTL, it can also update min_commit_ts of the PK. A standalone part was created for large transactions specially. The cache serves as -1. A fresh enough source of min_commit_ts info of large transactions for resolved-ts resolver -2. A fast path for read requests when they would otherwise return to coordinator to check PK's min_commit_ts. +1. Provides up-to-date `min_commit_ts` information for large transactions to the resolved-ts resolver. +1. Offers an optimized path for read requests, reducing the need to query PK for transaction status. + +Cache Management Strategy: + +1. Retention policy: + - Maximize retention of useful information. + - No eviction based on space constraints, leveraging the compact entry structure. + - Assumption: Limited number of concurrent large transactions. + +2. TTL management: + - Implement a substantial default TTL for cache entries. + - Rationale: Minimize redundant operations when readers encounter locks from these transactions. -##### Eviction +3. Post-commit procedure: + - Upon successful commitment of all secondary locks in a large transaction: + a. Coordinator broadcasts a TTL update to all TiKV nodes. + b. Extends TTL by several seconds. + - Purpose: Allow follower peers time to synchronize catch up with the leader. + - Caution: Immediate eviction may lead to stale reads encountering locks and missing the cache. -We would keep as much useful info as possible in the cache, and never evict any of them because of space issue. One entry only contains information like start_ts + min_commit_ts + status + TTL, which should make the cache small enough, considering our assumption of the number of ongoing large transactions. +#### TiKV resolved-ts Resolver -There should be a large defaut TTL of these entries, because we want to save unnecessary efforts when some reader meets a lock belonging to these transactions. +Operational Mechanism: -After the successfully commiting all secondary locks of a large transaction, the coordinator explicitly broadcasts a TTL update to all TiKV nodes, extending it to several seconds later. Don't immediately evict the entry to give the follower peers some time to catch up with leader, otherwise a stale read may encounter a lock and miss the cache. +1. Standard lock handling: + - Tracks normal locks using conventional methods. -#### TiKV resolved-ts resolver +2. Large pipelined transaction Locks: + - Identified by the "generation" field. + - Tracks only the start_ts of locks. -Resolver tracks normal locks as usual, but handles locks belonging to large pipelined transactions in a different way. The locks can be identified via the "generation" field. +Resolved-ts calculation: -For locks in large pipelined transactions, the resolver only tracks the start_ts. When calculating resolved-ts, it first attempts to map start_ts to min_commit_ts via the txn_status_cache. To maintain semantics, resolved-ts must be at least min_commit_ts + 1. If the cache lookup fails, it falls back to using start_ts for calculation. +- Primary Method: + - Attempts to map start_ts to min_commit_ts via txn_status_cache. + - Sets resolved-ts to max(min_commit_ts + 1, current_resolved_ts). +- Fallback Method: + - Uses start_ts for calculation if cache lookup fails. -Upon observing a LOCK DELETION, the resolver ceases tracking the corresponding start_ts for large pipelined transactions. This is justified as lock deletion only occurs once a transaction's final state is determined. + + +We preseve the resolved-ts semantics by ensuring that resolved-ts is always greater than or equal to min_commit_ts + 1. + +When the resolver observes a LOCK DELETION event, it immediately ceases tracking the corresponding start_ts for large pipelined transactions. This action is justified because lock deletion is a clear indicator that a transaction's final state has been determined. By stopping the tracking at this point, the resolver efficiently manages its resources and maintains an up-to-date view of active transactions. + +### Upgrading TiKV + +This design constitutes a non-intrusive modification, eliminating specific concerns during the upgrade process. In case of a cache miss, the system automatically falls back to the original approach, ensuring seamless backward compatibility. ### Benefits in resolving locks @@ -85,9 +124,9 @@ Across all lock resolution scenarios—including normal reads, stale reads, flas ### Compatibility -The key difference is that services can now observe much more locks. +The key difference is that services can now observe more locks. -Note that the current implementation still allows encountering locks with timestamps smaller than the resolved timestamp. This proposal doesn't change this behavior, so we don't anticipate correctness issues with this change. The main challenges will be related to performance and availability. +It's important to note that the current implementation still permits the encounter of locks with timestamps smaller than resolved-ts. This proposal maintains this existing behavior, thus we do not anticipate any correctness issues arising from this modification. The principal challenges we foresee are mainly performance and availability concerns. #### Stale read @@ -101,9 +140,7 @@ When it meets a lock, first query the txn_status_cache. When not found in the ca #### EBS snapshot backups -*TBD* - -It depends on Flashback. +Its only dependency on resolved-ts is to use Flashback. #### CDC @@ -113,9 +150,9 @@ Already well documented in [Large Transactions Don't Block Watermark](https://gi Memory: each cache entry takes at least 8(start_ts) + 8(min_commit_ts) + 1(status) + 8(TTL) = 33 bytes. Any TiKV instance can easily hold millions of such entries. -Latency: maintenance of resolved-ts requires extra work, but they can be asynchoronous, thus not affecting latency. +Latency: The additional operations required for resolved-ts maintenance can be executed asynchronously, thereby mitigating any potential impact on system latency. -RPCs: each large transaction sends N more RPCs per second, where N is the number of TiKVs. +RPCs: each large transaction sends N more RPCs per second, where N is the number of TiKVs. Batching can greatly reduce the RPC overhead. CPU: the mechanism may consume more CPU, but should be ignorable. @@ -123,16 +160,16 @@ CPU: the mechanism may consume more CPU, but should be ignorable. ## Possible future improvements -#### Tolerate lagging non-pipelined transactions +### Tolerate lagging non-pipelined transactions To get closer to our ultimate goal: minimize blocking of resolved-ts, we can further consider the case where resolved-ts being blocked by normal transaction locks. Typical causes could be: -- Memory locks from async commit and 1PC. Normal locks are region-partitioned can will not block resolved-ts of other regions. But concurrenty manager is a node-level instance. Memory locks can block every (leader) region in the same TiKV. -- Slow transactions which take too much time committing their locks -- Long-running transactions that may not be large. -- Node failures - +- Situation-1: Memory locks from async commit and 1PC. Normal locks are region-partitioned can will not block resolved-ts of other regions. But concurrenty manager is a node-level instance. Memory locks can block every (leader) region in the same TiKV. +- Situation-2: Slow transactions which take too much time committing their locks +- Situatino-3: Long-running transactions that may not be large. +- Situation-4: Node failures, network jitters, etc. +#### Approach-1: resolver pushing min_commit_ts Resolved-ts must continuously progress. However, it can't advance autonomously while ignoring locks. Such advancement would require the commit PK operation to either complete before the resolved-ts reaches a certain point or fail. This guarantee is not feasible. @@ -163,6 +200,6 @@ Locks belonging to the same transaction can be consolidated. To mitigate uncontrollable overhead and metastability risks, we limit our check to K transactions per region with the lowest min_commit_ts values. This approach is necessary given the potentially substantial total number of transactions. -#### Reduce write-read conflicts +#### Approach-2: long-running transactions setting min_commit_ts -Read requests typically require a check_txn_status to advance the min_commit_ts. We propose allowing large transactions to set their min_commit_ts to a higher value, potentially exceeding the current TSO. These min_commit_ts values, stored in the txn_status_cache, would enable read requests encountering locks to bypass them via a cache lookup. Large transactions would cease this special min_commit_ts setting once ready for prewrite. +If a transaction already runs for a long time, it must get a latest TSO as its min_commit_ts before starts prewriting, if it's not using async commit or 1PC. This prevents the short-lived locks blocking resolved-ts, whether they are memory locks or not. From f03244249108ff02aad8418943aec38a9f1f5e71 Mon Sep 17 00:00:00 2001 From: ekexium Date: Tue, 3 Sep 2024 23:56:37 +0800 Subject: [PATCH 4/8] format: reorganize the doc; focus more on proposed changes; remove some discussions Signed-off-by: ekexium --- ...0114-resolved-ts-for-large-transactions.md | 219 +++++++----------- 1 file changed, 89 insertions(+), 130 deletions(-) diff --git a/text/0114-resolved-ts-for-large-transactions.md b/text/0114-resolved-ts-for-large-transactions.md index 8c5799ac..2c4496ef 100644 --- a/text/0114-resolved-ts-for-large-transactions.md +++ b/text/0114-resolved-ts-for-large-transactions.md @@ -2,17 +2,17 @@ Author: @ekexium -Tracking issue: N/A +Tracking issue: https://github.com/tikv/tikv/issues/17459 -## Background +## 1. Background -The RFC is a variation and extension of @zhangjinpeng87 's [Large Transactions Don't Block Watermark](https://github.com/pingcap/tiflow/blob/master/docs/design/2024-01-22-ticdc-large-txn-not-block-wm.md). They aim to solve the same problem. +This RFC is a variation and extension of @zhangjinpeng87's [Large Transactions Don't Block Watermark](https://github.com/pingcap/tiflow/blob/master/docs/design/2024-01-22-ticdc-large-txn-not-block-wm.md). They aim to solve the same problem. -Resolved-ts is a mechanism that provides a temporal guarantee. It is defined as follows: Once a resolved timestamp T is observed by any component of the system, it is guaranteed that no new commit record with a commit timestamp less than or equal to T will be subsequently produced or observed by any component of the system. +Resolved-ts is a mechanism that provides a temporal guarantee. It is defined as follows: Once a resolved-ts T is observed by any component of the system, it is guaranteed that no new commit record with a commit-ts less than or equal to T will be subsequently produced or observed by any component of the system. In current TiKV(v8.3), large transactions can block resolve-ts from advancing, because it is calculated as `min(pd-tso, min(lock.ts))`, which is actually a more stringent constraint than its original definition. A lock from a pipelined txn can live several hours. This will make services dependent on resolved-ts unavailable. -## Goals +## 2. Goals - Current phase objectives: - Prevent pipelined transactions from impeding resolved-ts advancement. @@ -20,186 +20,145 @@ In current TiKV(v8.3), large transactions can block resolve-ts from advancing, b - Long-term goal: - Minimize the overhead of resolved-ts maintenance. - Maximize resolved-ts freshness. - - Achieve uninterrupted resolved-ts progression, addressing all potential blocking factors beyond long transactions and their associated locks. (Further details to be discussed in the final section of this proposal.) + - Achieve uninterrupted resolved-ts progression, addressing all potential blocking factors beyond large transactions. -## Design +## 3. Proposed Design -Key Concept: Utilizing `lock.min_commit_ts` instead of `lock.start_ts` for resolved-ts calculation. +### 3.1 Key Idea + +**Proposed Change:** Utilize `lock.min_commit_ts` instead of `lock.start_ts` for resolved-ts calculation. Rationale: 1. Resolved-ts definition is independent of LOCK CF. 2. Current use of `lock.start_ts` is based on the invariant: `lock.start_ts` < `lock.commit_ts`. -3. A valid resolved-ts doesn't necessitate the absence of locks with smaller start_ts, provided their future commit_ts are guaranteed to be larger. +3. A valid resolved-ts doesn't necessitate the absence of locks with smaller start_ts, as long as their future commit_ts are guaranteed to be larger. Advantages of `lock.min_commit_ts`: -1. Satisfies a similar invariant: `lock.min_commit_ts` <= `lock.commit_ts` -2. Unlike the static `lock.start_ts`, `lock.min_commit_ts` can be dynamically increased. - -### Maintanence of resolved-ts - -Key objective: Maximize all TiKV nodes' awareness of large pipelined transactions during their lifetime, i.e. from their first writes to all locks being committed. These info are necessary: - -1. `start_ts` -2. A fresh enough `min_commit_ts` -3. Status - -#### Coordinator - -For a pipelined transaction, its TTL manager is responsible for fetching a latest TSO as a candidate of min_commit_ts and update both the committer's inner state and PK in TiKV. After that, it broadcasts the `start_ts` and the new `min_commit_ts` to all TiKV stores. The PK update can be piggybacked on the `heartbeat` request. +1. Satisfies a similar invariant: `lock.min_commit_ts` <= its `commit_ts` +2. Unlike the constant `lock.start_ts`, `lock.min_commit_ts` can be dynamically increased. -Optionally, to avoid too many RPC overhead, the broadcast messages from different transactions can be batched. +### 3.2 Maintenance of resolved-ts -Atomic variables or locks may be needed for synchronization between the TTL manager and the committer. +Key objective: Maximize all TiKV nodes' awareness of large pipelined transactions during their lifetime. -#### Scaling out TiKVs +#### 3.2.1 Coordinator -Challenge: -During cluster expansion, when a new TiKV instance is integrated mid-transaction, the TTL manager must promptly incorporate it into its broadcast list. The TTL manager relies on the region cache for store information. However, if the region cache lacks awareness of a newly added TiKV, the TTL manager may inadvertently omit it from broadcasts. +**Proposed Change:** +- TTL manager fetches latest TSO as min_commit_ts candidate. +- Updates committer's inner state and PK in TiKV. +- Broadcasts `start_ts` and new `min_commit_ts` to all TiKV stores. +- PK update piggybacked on `heartbeat` request. +- Optional: Batch broadcast messages to reduce RPC overhead. -One solution is to have an background goroutine in the region cache to periodically refresh the complete store list. +Atomic variables or locks may be needed for synchronization between TTL manager and committer. -#### TiKV scheduler - heartbeat +#### 3.2.2 Scaling out TiKVs -Besides updating TTL, it also supports update min_commit_ts of the PK. +**Challenge:** New TiKV instances integrated mid-transaction may be omitted from broadcasts if region cache is unaware. -*TBD: should it also update max_ts?* +**Proposed Solution:** Implement a background goroutine in the region cache to periodically refresh the full store list. -#### TiKV txn_status_cache +An alternative solution is to let PD push the full store list to TiDBs when the cluster topology changes. However, there is no existing mechanism for this, and we don't want to introduce a new one, as it may be too complex. -A standalone part was created for large transactions specially. The cache serves as +#### 3.2.3 TiKV scheduler - heartbeat -1. Provides up-to-date `min_commit_ts` information for large transactions to the resolved-ts resolver. -1. Offers an optimized path for read requests, reducing the need to query PK for transaction status. +**Proposed Change:** Extend heartbeat functionality to support updating min_commit_ts of the PK. -Cache Management Strategy: +*TBD: Consider whether it should also update max_ts.* -1. Retention policy: - - Maximize retention of useful information. - - No eviction based on space constraints, leveraging the compact entry structure. - - Assumption: Limited number of concurrent large transactions. +#### 3.2.4 TiKV txn_status_cache -2. TTL management: - - Implement a substantial default TTL for cache entries. - - Rationale: Minimize redundant operations when readers encounter locks from these transactions. +**Proposed Change:** Introduce a standalone part next to the existing cache for large transactions that: +1. Provides `min_commit_ts` information for resolved-ts resolver. +2. Offers optimized path for read requests, reducing PK queries for transaction status. -3. Post-commit procedure: - - Upon successful commitment of all secondary locks in a large transaction: - a. Coordinator broadcasts a TTL update to all TiKV nodes. - b. Extends TTL by several seconds. - - Purpose: Allow follower peers time to synchronize catch up with the leader. - - Caution: Immediate eviction may lead to stale reads encountering locks and missing the cache. +Given the condition that given the condition that the number of concurrent pipelined transactions is limited, we want the cache to satisfy the following requirements: + - Maximize retention of useful information. + - Entries should be discarded based on coordinator directives, not due to TTL expiration or cache capacity limits. + - Maximize hit rate, i.e. the eviction of an transaction entry should happen after lock release, especially for follower peers. -#### TiKV resolved-ts Resolver +**Proposed Cache Implementations:** +1. A large enough capacity to minimize full cache evictions. +2. Compact cache entries to minimize memory overhead. +3. A large enough default TTL. +4. Post-commit procedure after committing all secondary locks: + a. Coordinator broadcasts TTL update to all TiKV nodes. + b. Extends TTL by several seconds. -Operational Mechanism: +#### 3.2.5 TiKV resolved-ts Resolver -1. Standard lock handling: - - Tracks normal locks using conventional methods. - -2. Large pipelined transaction Locks: - - Identified by the "generation" field. - - Tracks only the start_ts of locks. - -Resolved-ts calculation: +**Proposed Operational Mechanism:** +1. Standard lock handling remains unchanged. +2. For large pipelined transaction locks: + - Identify by "generation" field. + - Track only `start_ts` of locks. +3. Stop tracking `start_ts` for large pipelined transactions upon LOCK DELETION event. +**Proposed Resolved-ts calculation:** - Primary Method: - - Attempts to map start_ts to min_commit_ts via txn_status_cache. - - Sets resolved-ts to max(min_commit_ts + 1, current_resolved_ts). + - Map `start_ts` to `min_commit_ts` via txn_status_cache. + - Set resolved-ts to `max(min_commit_ts + 1, current_resolved_ts)`. - Fallback Method: - - Uses start_ts for calculation if cache lookup fails. - - - -We preseve the resolved-ts semantics by ensuring that resolved-ts is always greater than or equal to min_commit_ts + 1. - -When the resolver observes a LOCK DELETION event, it immediately ceases tracking the corresponding start_ts for large pipelined transactions. This action is justified because lock deletion is a clear indicator that a transaction's final state has been determined. By stopping the tracking at this point, the resolver efficiently manages its resources and maintains an up-to-date view of active transactions. + - Use start_ts if cache lookup fails. -### Upgrading TiKV +### 3.3 Upgrading TiKV -This design constitutes a non-intrusive modification, eliminating specific concerns during the upgrade process. In case of a cache miss, the system automatically falls back to the original approach, ensuring seamless backward compatibility. +This proposal is non-intrusive, with automatic fallback to original approach on cache miss, ensuring backward compatibility. -### Benefits in resolving locks +### 3.4 Benefits in resolving locks -Across all lock resolution scenarios—including normal reads, stale reads, flashbacks, and potentially write conflicts—a preliminary txn_status_cache lookup can significantly reduce unnecessary computational overhead introduced by large transactions. +Preliminary txn_status_cache lookup can reduce unnecessary overhead for various lock resolution scenarios. -### Compatibility +## 4. Compatibility Considerations -The key difference is that services can now observe more locks. +Key difference: Services may observe more locks. -It's important to note that the current implementation still permits the encounter of locks with timestamps smaller than resolved-ts. This proposal maintains this existing behavior, thus we do not anticipate any correctness issues arising from this modification. The principal challenges we foresee are mainly performance and availability concerns. +**Note:** Current implementation still permits the encounter of locks with timestamps smaller than resolved-ts. This proposal maintains this existing behavior, thus we do not anticipate any correctness issues arising from this modification. The principal challenges we foresee are mainly performance and availability concerns. -#### Stale read +### 4.1 Stale read -When it meets a lock, first query the txn_status_cache. When not found in the cache, fallback to leader read. +**Proposed Approach:** Query txn_status_cache first when encountering a lock. Fallback to leader read if not found in cache. -#### Flashback - -1. Compatilibity with CDC: Flashback will write a lock to block resolved-ts during its execution. It does not use pipelined transaction so this lock will be treated as a normal lock. +### 4.2 Flashback +1. Compatibility with CDC: Flashback will write a lock to block resolved-ts during its execution. It does not use pipelined transaction so the behavior is not affected. 2. The current and previous (up to v8.3) implementations of Flashback in TiKV rely on an incorrect assumption about resolved-ts guarantees. This misconception can lead to critical issues, such as the potential violation of transaction atomicity, as documented in https://github.com/tikv/tikv/issues/17415. -#### EBS snapshot backups - -Its only dependency on resolved-ts is to use Flashback. - -#### CDC - -Already well documented in [Large Transactions Don't Block Watermark](https://github.com/pingcap/tiflow/blob/master/docs/design/2024-01-22-ticdc-large-txn-not-block-wm.md). Briefly, a refactoring work is needed. - -### Cost - -Memory: each cache entry takes at least 8(start_ts) + 8(min_commit_ts) + 1(status) + 8(TTL) = 33 bytes. Any TiKV instance can easily hold millions of such entries. +### 4.3 EBS snapshot backups -Latency: The additional operations required for resolved-ts maintenance can be executed asynchronously, thereby mitigating any potential impact on system latency. +No direct impact. Only dependency on resolved-ts is through Flashback. -RPCs: each large transaction sends N more RPCs per second, where N is the number of TiKVs. Batching can greatly reduce the RPC overhead. +### 4.4 CDC -CPU: the mechanism may consume more CPU, but should be ignorable. +Refactoring work needed, as documented in [Large Transactions Don't Block Watermark](https://github.com/pingcap/tiflow/blob/master/docs/design/2024-01-22-ticdc-large-txn-not-block-wm.md). +## 5. Cost Analysis +- Memory: ~33 bytes per cache entry. TiKV instances can hold millions of entries. +- Latency: Minimal impact due to asynchronous execution of additional operations. +- RPCs: Increased RPC count, could be mitigated by batching. +- CPU: Slight increase, expected to be negligible. -## Possible future improvements +## 6. Possible future Improvements -### Tolerate lagging non-pipelined transactions +### 6.1 Tolerate lagging non-pipelined transactions -To get closer to our ultimate goal: minimize blocking of resolved-ts, we can further consider the case where resolved-ts being blocked by normal transaction locks. Typical causes could be: +To further minimize resolved-ts blocking, consider addressing: +1. Memory locks from async commit and 1PC +2. Slow transactions which take too much time committing their locks +3. Long-running transactions (not necessarily large) +4. Node failures, network issues, etc -- Situation-1: Memory locks from async commit and 1PC. Normal locks are region-partitioned can will not block resolved-ts of other regions. But concurrenty manager is a node-level instance. Memory locks can block every (leader) region in the same TiKV. -- Situation-2: Slow transactions which take too much time committing their locks -- Situatino-3: Long-running transactions that may not be large. -- Situation-4: Node failures, network jitters, etc. +#### 6.1.1 Approach-1: resolver pushing min_commit_ts -#### Approach-1: resolver pushing min_commit_ts - -Resolved-ts must continuously progress. However, it can't advance autonomously while ignoring locks. Such advancement would require the commit PK operation to either complete before the resolved-ts reaches a certain point or fail. This guarantee is not feasible. - -The left approach feasible to prevent resolved-ts blocked by normal transactions are actively pushing their min_commit_ts, similar to what is done to large transactions. +Concept: Enable the resolver to push `min_commit_ts` for normal transactions, similar to large transactions. However, locks using async commit cannot be pushed. -To sum up, when a resolver meets a lock whose min_commit_ts still blocks its - -- Check the cache - - Found if T.min_commit_ts >= R_TS candidate -> skip the lock - - Else, fallthrough - -- 2PC locks, check_txn_status and try to push its min_commit_ts. - - Committed -> return its commit_ts - - Commit_ts > R_TS candidate -> skip the lock - - Commit_ts < R_TS candidate -> block at commit_ts - 1. - - Min commit ts pushed, or min_commit_ts > R_TS candidate -> skip the lock - - Rolled back -> skip the lock - - Else -> block at min_commit_ts - 1 -- Async commit locks -> check its status - - Committed, same as 2PC locks - - Rolled back -> skip the lock - - Else if min_commit_ts > R_TS candidate -> skip the lock - - Else -> block at min_commit_ts + 1 - -Locks belonging to the same transaction can be consolidated. +To manage overhead, cap the number of transactions that can be pushed per region based on their `min_commit_ts`. -To mitigate uncontrollable overhead and metastability risks, we limit our check to K transactions per region with the lowest min_commit_ts values. This approach is necessary given the potentially substantial total number of transactions. +#### 6.1.2 Approach-2: long-running transactions setting min_commit_ts -#### Approach-2: long-running transactions setting min_commit_ts +Proposal: Long-running transactions (non-async commit, non-1PC) must obtain latest TSO as min_commit_ts before prewriting. -If a transaction already runs for a long time, it must get a latest TSO as its min_commit_ts before starts prewriting, if it's not using async commit or 1PC. This prevents the short-lived locks blocking resolved-ts, whether they are memory locks or not. +Benefit: Prevents short-lived locks from blocking resolved-ts, including memory locks. \ No newline at end of file From 7ea4f9d511123509a2bad1d716f5b3294865e65b Mon Sep 17 00:00:00 2001 From: ekexium Date: Tue, 10 Sep 2024 09:49:42 +0800 Subject: [PATCH 5/8] minor changes to clarify Signed-off-by: ekexium --- text/0114-resolved-ts-for-large-transactions.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/text/0114-resolved-ts-for-large-transactions.md b/text/0114-resolved-ts-for-large-transactions.md index 2c4496ef..e819b858 100644 --- a/text/0114-resolved-ts-for-large-transactions.md +++ b/text/0114-resolved-ts-for-large-transactions.md @@ -44,10 +44,10 @@ Key objective: Maximize all TiKV nodes' awareness of large pipelined transaction #### 3.2.1 Coordinator **Proposed Change:** -- TTL manager fetches latest TSO as min_commit_ts candidate. +- The TTL manager goroutine fetches latest TSO as min_commit_ts candidate. - Updates committer's inner state and PK in TiKV. -- Broadcasts `start_ts` and new `min_commit_ts` to all TiKV stores. - PK update piggybacked on `heartbeat` request. +- Broadcasts `start_ts` and new `min_commit_ts` to all TiKV stores. - Optional: Batch broadcast messages to reduce RPC overhead. Atomic variables or locks may be needed for synchronization between TTL manager and committer. @@ -134,7 +134,7 @@ Refactoring work needed, as documented in [Large Transactions Don't Block Waterm ## 5. Cost Analysis -- Memory: ~33 bytes per cache entry. TiKV instances can hold millions of entries. +- Memory: the minimum memory for each entry is 8(start_ts) + 8(min_commit_ts) + 1(status) + 8(TTL) = 33 bytes. TiKV instances can hold millions of entries. - Latency: Minimal impact due to asynchronous execution of additional operations. - RPCs: Increased RPC count, could be mitigated by batching. - CPU: Slight increase, expected to be negligible. From ecf46cfd8f94411ded382febc9e644bcd48d459a Mon Sep 17 00:00:00 2001 From: ekexium Date: Tue, 10 Sep 2024 12:59:08 +0800 Subject: [PATCH 6/8] State kvproto changes Signed-off-by: ekexium --- text/0114-resolved-ts-for-large-transactions.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/text/0114-resolved-ts-for-large-transactions.md b/text/0114-resolved-ts-for-large-transactions.md index e819b858..0f76f701 100644 --- a/text/0114-resolved-ts-for-large-transactions.md +++ b/text/0114-resolved-ts-for-large-transactions.md @@ -46,8 +46,8 @@ Key objective: Maximize all TiKV nodes' awareness of large pipelined transaction **Proposed Change:** - The TTL manager goroutine fetches latest TSO as min_commit_ts candidate. - Updates committer's inner state and PK in TiKV. -- PK update piggybacked on `heartbeat` request. -- Broadcasts `start_ts` and new `min_commit_ts` to all TiKV stores. +- `min_commit_ts` updates are piggybacked on `heartbeat` request. +- Broadcasts `start_ts` and new `min_commit_ts` to all TiKV stores. A new type of message will be introduced in kvproto. - Optional: Batch broadcast messages to reduce RPC overhead. Atomic variables or locks may be needed for synchronization between TTL manager and committer. From 8912b42d014d28bfb63619502f04b0e07776fb3b Mon Sep 17 00:00:00 2001 From: ekexium Date: Thu, 12 Sep 2024 18:45:37 +0800 Subject: [PATCH 7/8] describe the representative key in resolver Signed-off-by: ekexium --- text/0114-resolved-ts-for-large-transactions.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/text/0114-resolved-ts-for-large-transactions.md b/text/0114-resolved-ts-for-large-transactions.md index 0f76f701..c46aa627 100644 --- a/text/0114-resolved-ts-for-large-transactions.md +++ b/text/0114-resolved-ts-for-large-transactions.md @@ -48,6 +48,7 @@ Key objective: Maximize all TiKV nodes' awareness of large pipelined transaction - Updates committer's inner state and PK in TiKV. - `min_commit_ts` updates are piggybacked on `heartbeat` request. - Broadcasts `start_ts` and new `min_commit_ts` to all TiKV stores. A new type of message will be introduced in kvproto. +- Broadcasts transaction status when (1) all locks resolved (2) transaction is rolled back (3) transaction rollback finished (all locks resolved) - Optional: Batch broadcast messages to reduce RPC overhead. Atomic variables or locks may be needed for synchronization between TTL manager and committer. @@ -91,8 +92,9 @@ Given the condition that given the condition that the number of concurrent pipel 1. Standard lock handling remains unchanged. 2. For large pipelined transaction locks: - Identify by "generation" field. - - Track only `start_ts` of locks. + - Track only `start_ts` and 1 representative key of a pipelined transaction. 3. Stop tracking `start_ts` for large pipelined transactions upon LOCK DELETION event. + - When `start_ts` is unknown, use the representative key to find the transaction. **Proposed Resolved-ts calculation:** - Primary Method: From 62b9ec80ad672a52606a57567db8c2268447d8e3 Mon Sep 17 00:00:00 2001 From: ekexium Date: Wed, 9 Oct 2024 14:28:31 +0800 Subject: [PATCH 8/8] fix typo Signed-off-by: ekexium --- text/0114-resolved-ts-for-large-transactions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0114-resolved-ts-for-large-transactions.md b/text/0114-resolved-ts-for-large-transactions.md index c46aa627..700101d1 100644 --- a/text/0114-resolved-ts-for-large-transactions.md +++ b/text/0114-resolved-ts-for-large-transactions.md @@ -73,7 +73,7 @@ An alternative solution is to let PD push the full store list to TiDBs when the 1. Provides `min_commit_ts` information for resolved-ts resolver. 2. Offers optimized path for read requests, reducing PK queries for transaction status. -Given the condition that given the condition that the number of concurrent pipelined transactions is limited, we want the cache to satisfy the following requirements: +Given the condition that the number of concurrent pipelined transactions is limited, we want the cache to satisfy the following requirements: - Maximize retention of useful information. - Entries should be discarded based on coordinator directives, not due to TTL expiration or cache capacity limits. - Maximize hit rate, i.e. the eviction of an transaction entry should happen after lock release, especially for follower peers.