This document proposes region-level isolation in TiKV, preventing hot regions from overwhelming tenant resources. The design extends the existing resource group-based priority system with region-level virtual time (VT) tracking.
TiKV implements resource control at the resource group level:
- Resource groups represent tenants and track RU (Resource Unit) consumption
- Each resource group has a
group_priority(LOW, MEDIUM, HIGH) - The
ResourceControlleruses mClock algorithm with virtual time (VT) per resource group - Tasks are ordered by:
concat_priority_vt(group_priority, resource_group_vt)
No region-level fairness: A hot region (hot keys or large scans) can monopolize resources within a tenant, starving other regions belonging to the same tenant.
Hot regions should be deprioritized to prevent resource monopolization within a tenant.
Introduce region-level virtual time (VT) alongside existing resource group VT. Each request's priority is determined by three factors in hierarchical order:
- group_priority: Tenant priority (HIGH/MEDIUM/LOW) - tenant isolation
- group_vt: Resource group virtual time - tenant fairness
- region_vt: Region virtual time - region fairness within tenant
Replace the current 64-bit u64 priority with a struct:
struct TaskPriority {
group_priority: u8, // 1-16 (HIGH/MEDIUM/LOW from PD)
group_vt: u64, // Resource group virtual time
region_vt: u64, // Region virtual time
}Comparison order (most significant first):
group_priority: Higher value = higher priority (tenant isolation)group_vt: Lower value = higher priority (tenant fairness)region_vt: Lower value = higher priority (region fairness)
On task scheduling:
- Group VT increases by
vt_delta_for_get(fixed per group) - Region VT increases by
vt_delta_for_get(varies based on region hotness) - Hot regions accumulate VT faster → pushed back in queue
On task completion:
- Group VT increases by actual CPU time consumed
- Region VT increases by actual CPU time consumed
Periodic normalization (every ~1 second):
- Find min/max VT across all regions
- Pull lagging regions toward leader (prevent starvation)
- Reset all VTs if near overflow
Hot regions accumulate high VT and get deprioritized, which affects split decisions based on served QPS.
When a region splits, VT behavior depends on CPU utilization:
When CPU utilization > 80% (system overloaded):
- Split regions share a common VT inherited from parent region
- Both child regions contribute to and read from the same VT tracker
- Maintains strong traffic moderation - even after splitting, the hot key/region group remains deprioritized
- Prevents split from immediately bypassing backpressure
When CPU utilization drops < 80% (system has capacity):
- Split regions transition to independent VTs
- Each region gets its own VT tracker, initialized to common VT value
- Allows natural load balancing
Implementation:
- Track CPU utilization as rolling average (~10 seconds)
- On region split, create
RegionGroupif CPU > 80%, linking child regions to shared VT - Periodically check CPU utilization (every 1-5 seconds)
- When CPU drops < 80%, dissolve region groups and transition to independent VTs
struct RegionResourceTracker {
region_vts: DashMap<u64, RegionVtTracker>,
cpu_utilization: AtomicU64, // Rolling average
}
struct RegionVtTracker {
virtual_time: AtomicU64,
vt_delta_for_get: AtomicU64,
parent_vt: Option<Arc<AtomicU64>>, // Shared parent VT if CPU > 80% at split
}
impl RegionResourceTracker {
fn get_and_increment_vt(region_id: u64) -> u64 {
// If parent_vt exists, use shared parent VT
// Otherwise use independent VT
}
fn on_region_split(parent_id: u64, child1_id: u64, child2_id: u64) {
// Get parent VT value
// If cpu_utilization > 80%:
// Create Arc<AtomicU64> with parent VT
// Both children share reference to parent_vt
// Else:
// Both children get independent VT initialized to parent VT
}
fn check_and_transition_to_independent() {
// If cpu_utilization < 80%:
// For each region with parent_vt:
// Copy parent_vt value to virtual_time
// Set parent_vt to None
}
fn update_vt_deltas() {
// Periodically adjust vt_delta based on region hotness
// ratio = region_ru / avg_ru
// delta = base_delta * ratio
}
fn normalize_region_vts() {
// Pull lagging regions forward, reset if near overflow
}
fn consume(region_id: u64, cpu_time: Duration, keys: u64, bytes: u64) {
// Increment VT based on actual consumption
// If parent_vt exists, increment shared parent VT
}
fn cleanup_inactive_regions() {
// Remove regions with no recent VT updates
}
}Add region_id field:
const REGION_ID_MASK: u8 = 0b0000_0100;
impl TaskMetadata {
fn region_id(&self) -> u64 {
// Extract from metadata bytes
}
}Update ResourceController to include region VT:
impl TaskPriorityProvider for ResourceController {
fn priority_of(&self, extras: &Extras) -> TaskPriority {
let metadata = TaskMetadata::from(extras.metadata());
// 1. Get group VT
let group_vt = self.resource_group(metadata.group_name())
.get_group_vt(level, override_priority);
// 2. Get region VT
let region_id = metadata.region_id();
let region_vt = self.region_tracker.get_and_increment_vt(region_id);
TaskPriority { group_priority, group_vt, region_vt }
}
}Wire region tracking into execution paths:
// After task completes:
region_tracker.consume(
region_id,
cpu_time,
keys_scanned,
bytes_read,
);Periodic normalization and delta updates:
// Run every 1 second
fn periodic_region_maintenance() {
region_tracker.normalize_region_vts();
region_tracker.update_vt_deltas();
}[resource-control]
enable-region-tracking = true-
Temporary traffic moderation: VT-based traffic moderation is temporary. It does not persist if a node is rebooted after regions are split.
-
Shared region fairness issues: When multiple resource groups access the same region:
- Innocent tenant penalized: Tenant A's heavy usage increases region VT, penalizing Tenant B's requests
- Hot region stays hot: If tenants alternate requests, each tenant's group_vt stays low, so region never gets properly deprioritized
Mitigation: Ensure resource groups don't share tables. Regions are generally created at table boundary if big enough.