Skip to content

Commit 30e774a

Browse files
Merge 9d4023f into blathers/backport-staging-v24.3.13-145919
2 parents 538b9b7 + 9d4023f commit 30e774a

File tree

2 files changed

+47
-13
lines changed

2 files changed

+47
-13
lines changed

pkg/util/admission/admissionpb/io_threshold.go

+12-9
Original file line numberDiff line numberDiff line change
@@ -14,21 +14,24 @@ import (
1414

1515
// Score returns, as the second return value, whether IO admission control is
1616
// considering the Store overloaded wrt compaction of L0. The first return
17-
// value is a 1-normalized float (i.e. 1.0 is the threshold at which the
18-
// second value flips to true).
17+
// value is a 1-normalized float, where 1.0 represents severe overload, and
18+
// therefore 1.0 is the threshold at which the second value flips to true.
19+
// Admission control currently trys to maintain a store around a score
20+
// threshold of 0.5 for regular work and lower than 0.25 for elastic work. NB:
21+
// this is an incomplete representation of the signals considered by admission
22+
// control -- admission control additionally considers disk and flush
23+
// throughput bottlenecks.
1924
//
2025
// The zero value returns (0, false). Use of the nil pointer is not allowed.
2126
//
22-
// TODO(sumeer): consider whether we need to enhance this to incorporate
23-
// overloading via flush bandwidth. I suspect we can get away without
24-
// incorporating flush bandwidth since typically chronic overload will be due
25-
// to compactions falling behind (though that may change if we increase the
26-
// max number of compactions). And we will need to incorporate overload due to
27-
// disk bandwidth bottleneck.
28-
//
2927
// NOTE: Future updates to the scoring function should be version gated as the
3028
// threshold is gossiped and used to determine lease/replica placement via the
3129
// allocator.
30+
//
31+
// IOThreshold has various parameters that can evolve over time. The source of
32+
// truth for an IOThreshold struct is admission.ioLoadListener, and is
33+
// propagated elsewhere using the admission.IOThresholdConsumer interface. No
34+
// other production code should create one from scratch.
3235
func (iot *IOThreshold) Score() (float64, bool) {
3336
// iot.L0NumFilesThreshold and iot.L0NumSubLevelsThreshold are initialized to
3437
// 0 by default, and there appears to be a period of time before we update

pkg/util/admission/io_load_listener.go

+35-4
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,12 @@ var walFailoverUnlimitedTokens = settings.RegisterBoolSetting(
119119
"when true, during WAL failover, unlimited admission tokens are allocated",
120120
false)
121121

122-
// Experimental observations:
122+
// The following experimental observations were used to guide the initial
123+
// implementation, which aimed to maintain a sub-level count of 20 with token
124+
// calculation every 60s. Since then, the code has evolved to calculate tokens
125+
// every 15s and to aim for regular work maintaining a sub-level count of
126+
// l0SubLevelCountOverloadThreshold/2. So this commentary should be
127+
// interpreted in that context:
123128
// - Sub-level count of ~40 caused a node heartbeat latency p90, p99 of 2.5s,
124129
// 4s. With a setting that limits sub-level count to 10, before the system
125130
// is considered overloaded, and adjustmentInterval = 60, we see the actual
@@ -133,9 +138,35 @@ var walFailoverUnlimitedTokens = settings.RegisterBoolSetting(
133138
// then we run the risk of having 100+ sub-levels when we hit a file count
134139
// of 1000. Instead we use a sub-level overload threshold of 20.
135140
//
136-
// We've set these overload thresholds in a way that allows the system to
137-
// absorb short durations (say a few minutes) of heavy write load.
138-
const l0FileCountOverloadThreshold = 1000
141+
// A sub-level count of l0SubLevelCountOverloadThreshold results in the same
142+
// score as a file count of l0FileCountOverloadThreshold. Exceptions: a small
143+
// L0 in terms of bytes (see IOThreshold.Score); these constants being
144+
// overridden in the cluster settings
145+
// admission.l0_sub_level_count_overload_threshold and
146+
// admission.l0_file_count_overload_threshold. We ignore these exceptions in
147+
// the discussion here. Hence, 20 sub-levels is equivalent in score to 4000 L0
148+
// files, i.e., 1 sub-level is equivalent to 200 files.
149+
//
150+
// Ideally, equivalence here should match equivalence in how L0 is scored for
151+
// compactions. CockroachDB sets Pebble's L0CompactionThreshold to a constant
152+
// value of 2, which results in a compaction score of 1.0 with 1 sub-level.
153+
// CockroachDB does not override Pebble's L0CompactionFileThreshold, which
154+
// defaults to 500, so 500 files cause a compaction score of 1.0. So in
155+
// Pebble's compaction scoring logic, 1 sub-level is equivalent to 500 L0
156+
// files.
157+
//
158+
// So admission control is more sensitive to higher file count than Pebble's
159+
// compaction scoring. l0FileCountOverloadThreshold used to be 1000 up to
160+
// v24.3, and increasing it to 4000 was considered significant enough --
161+
// increasing to 10000, to make Pebble's compaction logic and admission
162+
// control equivalent was considered too risky. Note that admission control
163+
// tries to maintain a score of 0.5 when admitting regular work, which if
164+
// caused by file count represent 2000 files. With 2000 files, the L0
165+
// compaction score is 2000/500 = 4.0, which is significantly above the
166+
// compaction threshold of 1.0 (at which a level is eligible for compaction).
167+
// So one could argue that this inconsistency between admission control and
168+
// Pebble is potentially harmless.
169+
const l0FileCountOverloadThreshold = 4000
139170
const l0SubLevelCountOverloadThreshold = 20
140171

141172
// ioLoadListener adjusts tokens in kvStoreTokenGranter for IO, specifically due to

0 commit comments

Comments
 (0)