Skip to content

Conversation

@CalvinNeo
Copy link
Member

What is changed and how it works?

Issue Number: Close #xxx

What's Changed:


Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Release note


lcwangchao and others added 30 commits May 9, 2025 08:34
close tikv#18441

When the secondary commit failed with error `CommitTsExpired`, collect the MVCC info for debugging.

Signed-off-by: Chao Wang <[email protected]>
…ikv#18448)

close tikv#18441

In the previous PR, we only collect mvcc info when `commit_role` is `Secondary` when `commit_ts < min_commit_ts`. However, when resolving a lock, the `commit_role` of `commit` is None and we cannot get any mvcc info when this error happens. This PR does an enhancement and checks whether the resolved key is primary in the lock, if not, it still collect mvcc for further debugging.

Signed-off-by: Chao Wang <[email protected]>
Signed-off-by: 王超 <[email protected]>

Co-authored-by: cfzjywxk <[email protected]>
ref tikv#17465

Following tikv#17605, another attempt to update rust toolchain

Changes:

- Language
  - After rust-lang/rust#134258, we can't manually impl both `ToString` and `fmt::Display`, so this PR add a new trait `ToStringValue` to work around types type produces different result between ToString and Display.
- Clippy
  - `Option::map_or(false, ...)` --> `Option::is_some_and(...)`
  - `Option::map_or(true, ...)` --> `Option::is_none_or(...)`
  - `(a + b - 1 )/ b` --> `a.div_ceil(b)`
  - `io::Error::new(ErrorKind::Other, ...)` --> ` io::Error::other(...)`
  - `Slice::group_by` --> `Slice::chunk_by`
  - `Result::map_err(|e| {...; e})` --> `Result::inspect_err(|e| { ... })`
  - `Map::get(&key).is_{some, none}()` --> `Map::contains_key()`
- Formatter
  - The import order now follows ascii order, e.g. before is "use crate::{a, b, c, A, B, C}", after is "use crate::{A, B, C, a, b, c}". Most changes are due to this.
  - List in rust doc should be properly aligned.
- cargo-deny
  - `vulnerability`, `notice` and `unsound` can't be config in version 2, and `unmaintained` can't be allowed anymore(but support setting `workspace` to allow indirected pkgs). So replacing some unmaintained packages with suggested alternatives.  (See: tikv#18416)

Signed-off-by: glorv <[email protected]>
…v#18454)

ref tikv#18434

Close some background schedulers before shutting down.

We need to restart both the KV engine and the Raft engine in a test to set some non-online configs.
e.g. turning off Titan.
These background workers have references to either the KV engine or the Raft engine, and they are also
self-referencing, causing the KV engine and Raft engine never get closed on shutting down. We need this
to interrupt the infinite loop, and release the DBs' references.

The best way to resolve this would be using tokio's unbound channel which allows downgrade to weak ptr.
However, TiKV just has too many Arc circular dependencies, it is nearly impossible to detangle it.

Signed-off-by: Yang Zhang <[email protected]>
…18439)

close tikv#18463

Enhances the detection mechanism to cover the I/O jitters on kvdb disk if deploys with separated mount paths.

Signed-off-by: lucasliang <[email protected]>
close tikv#18474

bump the version of pprof-rs to 0.15

Signed-off-by: Yang Keao <[email protected]>
close tikv#18465

downgrade rust toolchain to fix arm64 build(It's a workaround of rust bug rust-lang/rust#141306)

Signed-off-by: glorv <[email protected]>
ref tikv#15990

Fix the issue where the `yatp_task_wait_duration` metric has no data 
point because it was not registered to prometheus.

Signed-off-by: Bisheng Huang <[email protected]>
…ikv#18484)

close tikv#18490

cdc register for a not found region: this indicate that region leader might be transferred to other nodes.

cdc failed to schedule barrier for delta before delta scan: this always happens if the channel is disconnected

cdc send scan event failed: this always happens if the channel is disconnected or full.

All these errors are temporary, so set their log level to WARN.

Signed-off-by: 3AceShowHand <[email protected]>
close tikv#18434

The real bug has been fixed by distinguishing Titan blob index from RocksDB blob index during RocksDB upgrade effort.
This PR just adds a regression test to prove it is fixed.
We reproduced the error with the same test code in 7.5

Signed-off-by: Yang Zhang <[email protected]>
close tikv#18497, ref pingcap/tidb#61318

backup_stream: encode ts related field into meta file path.

Signed-off-by: 3pointer <[email protected]>
Signed-off-by: 3pointer <[email protected]>

Co-authored-by: 山岚 <[email protected]>
close tikv#18493

Some of non-fatal error level logs during backing up are now warn level.

Signed-off-by: Juncen Yu <[email protected]>
close tikv#18434

Fixing Titan blob indices causing snapshot apply failures after Titan is turned off bug.

Signed-off-by: Yang Zhang <[email protected]>
…nfig file. (tikv#18505)

close tikv#18503

Make TiKV can inherit the last configurations on `region-size` to avoid change the default size
of region unexpectedly.

Signed-off-by: lucasliang <[email protected]>
close tikv#18541

Run RPC function switch_mode on the blocking acceptable thread.

Signed-off-by: Jianjun Liao <[email protected]>
…slowlog. (tikv#18562)

close tikv#18561

Fix incorrect and misleading index logging in StoreMsg of slowlog.

Signed-off-by: lucasliang <[email protected]>
close tikv#18533

To mitigate the impact of stalls when awakening too many regions, we break up all regions into small batches.

Signed-off-by: lucasliang <[email protected]>
…r. (tikv#18565)

close tikv#18532

Optimize the handling of `CompactedEvent` in raftstore by moving it to `split-check` worker.

Signed-off-by: lucasliang <[email protected]>
close tikv#18573

fix Issue tikv#18573 : Error occurred while make doc
fix it by modify Makefile
You can see the solution in the issue

Signed-off-by: DogDu <[email protected]>

Co-authored-by: DogDu <[email protected]>
…itters. (tikv#18590)

close tikv#18549

Removes the logging for "sst ingest is too slow" to avoid latency jitters.

Signed-off-by: lucasliang <[email protected]>
close tikv#18506

replace some error log with warning log

Signed-off-by: bufferflies <[email protected]>
close tikv#10047

Register Missing TiKV Configs to Prometheus
Due to recent iterations and refactoring in TiKV, some important module configurations were not reported via metrics. 
This PR registers the configuration metrics for these key modules

Signed-off-by: exit-code-1 <[email protected]>
Signed-off-by: zhy <[email protected]>

Co-authored-by: lucasliang <[email protected]>
ref tikv#15990

build: bump tikv pkg version

Signed-off-by: ti-chi-bot <[email protected]>
…he last config file (tikv#18626)

ref tikv#18503

Avoid the inheritage of unexpected configurations from the last config file.

Signed-off-by: lucasliang <[email protected]>
close tikv#18605

Optimizes `fetch_entries_to` in Raft-Engine to reduce contention and improve performance under mixed workloads.

Signed-off-by: lucasliang <[email protected]>
Connor1996 and others added 24 commits September 17, 2025 16:48
tikv#18967)

close tikv#18743

Optimize async snapshot and write tail latency with many SSTs

Signed-off-by: Connor1996 <[email protected]>
close tikv#18955

Reduce frequency of store size reporting

Signed-off-by: Yang Zhang <[email protected]>
ref tikv#18498

Add more duplicate entry checks in the write path, panic if there is unexpected results.
Note the panic operation should be removed in new release or production.

Signed-off-by: cfzjywxk <[email protected]>
Signed-off-by: cfzjywxk <[email protected]>
…#18940)

close tikv#18939

disable the buggy auto priority quota limiter

Signed-off-by: glorv <[email protected]>
…usly (tikv#18984)

close tikv#18983

Address the bug that the `raftstore` thread is panic on accessing raft logs of asynchronously destroyed peer in `on_raft_log_gc_tick()`.

Signed-off-by: lucasliang <[email protected]>
ref tikv#18498

Disable ENABLE_DUP_KEY_DEBUG by default

Signed-off-by: ekexium <[email protected]>
…l thresholds. (tikv#18710)

close tikv#18708

This PR addresses performance stability issues caused by increasing
storage.flow-control.l0-file-threshold and
storage.flow-control.soft-pending-compaction-bytes-limit. Previously,
raising these values could reduce the effectiveness of RocksDB’s compaction
speed-up mechanism, because the RocksDB internal thresholds
(level0-slowdown-writes-trigger and soft-pending-compaction-bytes-limit)
would be overridden, delaying compaction acceleration.

Key improvements:
1. Conditional override of RocksDB thresholds:
  - level0-slowdown-writes-trigger is overridden by l0-file-threshold
only if it is smaller.
  - soft-pending-compaction-bytes-limit is overridden only if it is
smaller than storage.flow-control.soft-pending-compaction-bytes-limit.
This ensures that increasing flow-control settings does not weaken
compaction acceleration, while user-configured RocksDB thresholds that
are larger than the flow-control limits are overriden, allowing compaction
speed-up to trigger before write flow control.
flow control.
3. Updated write stall check:
  - ingest_maybe_slowdown_writes now uses level0-stop-writes-trigger
instead of level0-slowdown-writes-trigger to determine whether ingest
may trigger a write stall.
  - This keeps the original behavior, since `l0-file-threshold` overrides
`level0-stop-writes-trigger`, just like the previous behavior with
`level0-slowdown-writes-trigger`. Ideally, flow-control settings would
be used directly to determine write stalls, but
`ingest_maybe_slowdown_writes` cannot access the flow-control module
configuration because this function resides inside the Engine module。

After this change, write control effectively has three stages:
1. Compaction acceleration: triggered when RocksDB thresholds are reached.
2. Flow control: triggered at storage.flow-control.l0-file-threshold and
storage.flow-control.soft-pending-compaction-bytes-limit.
4. Stop writes: triggered at storage.flow-control.hard-pending-compaction-bytes-limit.

Signed-off-by: hhwyt <[email protected]>
close tikv#18999

fix external storage cache block

Signed-off-by: Jianjun Liao <[email protected]>
…ikv#18923)

close tikv#18815

Add network/io info collection for TopSQL:
1. Introduce resource-metering.enable-network-io-collection config to control whether enable this new feature. Default is disabled.
2. Collect network_in, network_out, logical_read, logical_write execution info and recorded in TopSQL:
    i. Since the LocalStorage of TopSQL recorder is TLS, and only be accessed inside the thread with attached tag. Here, we use GLOBAL_TRACKERS to help record network_in data size.
    ii. For network_out, we can only directly get resp's size for Coprocessor request. Thus we need to collect this data one by one for all requests. Since we only care about requests that potentially generate large response, we bypass some "write requests" whose response only contained "commit_ts" data.
    iii. Use the processed_size(https://github.com/tikv/tikv/blob/1deb3a135dc41c3ca227e3d5a29712526b492a4c/components/tikv_kv/src/stats.rs#L195) as logical read size
    iv. Use the scheduled tasks' write_bytes() as logical write size

Signed-off-by: yibin87 <[email protected]>
close tikv#17221

When a SIGTERM signal is received, TiKV tells PD it's stopping by StoreHeartbeat. PD then try to move all the leaders from that TiKV instance before it fully shuts down.

Signed-off-by: hujiatao0 <[email protected]>
ref tikv#15990

Propagates errors in get_next_region_context up the call stack

Signed-off-by: Yang Zhang <[email protected]>
…kUp (tikv#19008)

close tikv#19007

Support basic summary and metrics for push down IndexLookUp

Signed-off-by: Chao Wang <[email protected]>
close tikv#18949

Check the memory locks in `ExtraSnapStoreAccessor:: get_local_region_storage` to make sure the `IndexLookUp` can get the consistency rows for 1pc or async commit.

Signed-off-by: Chao Wang <[email protected]>
…nder is delayed. (tikv#19015)

close tikv#19004

Address the corner case that the `raftstore` thread is panic on handling `ReadyToDestroyPeer `.

Signed-off-by: lucasliang <[email protected]>
tikv#19025)

close tikv#18498

Signed-off-by: Tharanga Gamaethige <[email protected]>

Co-authored-by: Tharanga Gamaethige <[email protected]>
…troying. (tikv#19030)

ref tikv#19004, close tikv#19034

Fix the bug introduced by the previous work PR#19015, which makes the under destroying peer could not handle `ApplyRes::(...)` as expected.

Signed-off-by: lucasliang <[email protected]>
close tikv#19048

fix potential panic which may happen when subscribe the region and meet rollback and prewrite entry

Signed-off-by: 3AceShowHand <[email protected]>
close tikv#19006

Update Auzre SDK to 0.18, the highest version compatible with tikv rust version.
Adapt the new interfaces and managed identity for Azure managed identity.

Signed-off-by: RidRisR <[email protected]>
…ords also (tikv#19029)

close tikv#18814

When "enable_network_io_collection" is set, 
1. Picks top n records for network and top n records for logical io. One record will be picked at most once.
2. Add new aggregator for region_id, pick top n records for cpu, network, logical io, and report final results.

Signed-off-by: yibin87 <[email protected]>
close tikv#18604

In test we noticed if download sst failed half way due to some reason, the files are not deleted and thus occupying spaces. We should clean them up.
Fix broken br metrics and add one more for download failures

Signed-off-by: Wenqi Mou <[email protected]>
close tikv#18843, close tikv#18950

1. Remove read_buf_exact_size for s3 hyper client
2. Use cloud::blob::read_to_end to read migrations from futures::io::AsyncRead
3. Use bytes::Bytes to speed up deallocating MetaFile

Signed-off-by: Jianjun Liao <[email protected]>
Signed-off-by: Jianjun Liao <[email protected]>
@ti-chi-bot
Copy link

ti-chi-bot bot commented Oct 29, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign calvinneo for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XXL label Oct 29, 2025
@pingcap-cla-assistant
Copy link

pingcap-cla-assistant bot commented Oct 29, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
5 out of 10 committers have signed the CLA.

✅ yibin87
✅ CalvinNeo
✅ 3AceShowHand
✅ Leavrth
✅ lcwangchao
❌ Tristan1900
❌ RidRisR
❌ squalfof
❌ LykxSassinator
❌ tharanga
You have signed the CLA already but the status is still pending? Let us recheck it.

@CalvinNeo CalvinNeo force-pushed the merge-tikv-2025-10-30 branch from 9fccb1e to 5aa0a4d Compare October 29, 2025 10:06
@ti-chi-bot
Copy link

ti-chi-bot bot commented Oct 29, 2025

@CalvinNeo: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test 5aa0a4d link true /test pull-unit-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.