fix: limit number of concurrent optimized compactions #26319

gwossum · 2025-04-23T22:39:39Z

Limit number of concurrent optimized compactions so that level compactions do not get starved. Starved level compactions result in a sudden increase in disk usage.

Add [data] max-concurrent-optimized-compactions for configuring maximum number of concurrent optimized compactions. Default value is 1.

Closes: #26315

Limit number of concurrent optimized compactions so that level compactions do not get starved. Starved level compactions result in a sudden increase in disk usage. Add [data] max-concurrent-optimized-compactions for configuring maximum number of concurrent optimized compactions. Default value is 1. Closes: #26315

davidby-influx · 2025-04-26T00:45:18Z

@gwossum & @devanbenz - I have broken out the compaction planning into a method separate from the asynchronous Go proc with a ticker that drive it. This let me write a test that I think we should massively expand to test and explore the planner.

One thing I think is a bug is that we count generations skipping files in use when we are planning optimization. This means in my second new test case, the generation count is zero, because the files were acquired in the initial level planning. But I think that the generation count in this case should be one. What do you think?

Please expand the test cases in TestEnginePlanCompactions and make sure that Engine.PlanCompactions is doing the right thing in all the edge cases.

tsdb/engine/tsm1/engine.go

tsdb/engine/tsm1/compact_test.go

devanbenz

Just two comments regarding the level5Aggressive variable and the logic. As you point out here: https://github.com/influxdata/influxdb/pull/26319/files#r2061085298 it's likely that we have a bug.

devanbenz · 2025-04-28T16:50:19Z

tsdb/engine/tsm1/engine.go

+					if level5Aggressive {
+						log.Info("Planning aggressive optimized compaction because all level 5 is planned for aggressive")
+						aggressive = true
+					} else if isOpt, filename, heur := e.IsGroupOptimized(theGroup); isOpt {


We could probably pull this out of here and put it within the PlanCompactions method. We are checking there and then again here for IsGroupOptimized.

tsdb/engine/tsm1/engine.go

devanbenz · 2025-04-28T17:11:19Z

tsdb/engine/tsm1/engine.go

+		if isOpt, filename, heur := e.IsGroupOptimized(group); isOpt {
+			e.logger.Info("Promoting full compaction level 4 group to optimized level 5 compaction group because it contains an already optimized TSM file",
+				zap.String("optimized_file", filename), zap.String("heuristic", heur), zap.Strings("files", group))
+			level5Groups = append(level5Groups, group)


We should probably Release the group that we moved from initiallevelLevel4Groups to level5Groups here? So it is no longer in filesInUse by the compaction planner and can be picked up by the PlanOptimize planner?

Actually... if we move the compaction group from level4Groups to level5Groups, its still in use.

That means we should set FileInUse to false when counting generations.

If we already plan a level4Group and it has a TSM file that is at aggressivePointsPerBlock move it in to a level5Group and set level5Aggressive to true. This shard should be compacted using the aggressive points per block amount so we don't need to re-write TSM files at the lower points per block.

devanbenz · 2025-04-28T19:26:02Z

I would really like to find some ways to pull out the actual "scheduling" of compactions too and test some more of this large chunk of code:

https://github.com/influxdata/influxdb/blob/gw/26315/opt_compact_limiter/tsdb/engine/tsm1/engine.go#L2208-L2273

I like how we have modularized planning and made it very easy to test. Would be nice to break some of this out and make it more test-able. Going to start brainstorming on this.

abstraction on top of CompactionGroups that will include the points per block required during compaction.

…5/opt_compact_limiter

davidby-influx

Two questions, still need to review test coverage.

tsdb/engine/tsm1/engine.go

a level5 compaction group that has 1 > tsm file(s) that are over max points per block. This will ensure we do not unwind any tsm files already > max points per block

davidby-influx

Comments, some outdated.

davidby-influx · 2025-05-05T20:26:09Z

tsdb/engine/tsm1/compact_test.go

@@ -2704,6 +2704,9 @@ func TestDefaultPlanner_PlanOptimize_Test(t *testing.T) {

 			cp := tsm1.NewDefaultPlanner(ffs, tsdb.DefaultCompactFullWriteColdDuration)

+			compacted, _ := cp.FullyCompacted()


Check string return, too. Should be empty.

davidby-influx · 2025-05-05T21:26:23Z

tsdb/engine/tsm1/engine.go

@@ -2096,11 +2118,59 @@ func (e *Engine) ShouldCompactCache(t time.Time) bool {
 	return t.Sub(e.Cache.LastWriteTime()) > e.CacheFlushWriteColdDuration
 }

+// isSingleGeneration returns true if a group contains files from a single generation.
+func (e *Engine) isSingleGeneration(group CompactionGroup) bool {


Is this used anywhere?

davidby-influx

LGTM. Please wait for @gwossum's approval, as well before merging.

gwossum · 2025-05-06T17:47:43Z

tsdb/engine/tsm1/engine.go

+const waitForOptimization = time.Hour
+const tickPeriod = time.Second
+
+var ticksBeforeOptimize = int(waitForOptimization.Seconds())


ticksBeforeOptimize is only correct if tickPeriod is time.Second.

I think this will generalize it if tickPeriod is not time.Second:

var ticksBeforeOptimize = int(waitForOptimization.Seconds() * time.Second / tickPeriod)

gwossum · 2025-05-06T17:52:48Z

tsdb/engine/tsm1/engine.go

+				if cycleCount <= ticksBeforeOptimize {
+					return true, waitMessage
+				} else {
+					return false, ""


It might be better to do a startTime := ticks.Now() before the loop and then do if time.Since(startTime) > waitForOptimize here because this loop may not take exactly one tickPeriod to run. Also, it seems easier since you don't have to do dimensional analysis to determine if the ticksBeforeOptimize calculation is correct.

This loop runs indefinitely for the life of the program afaik. Wouldn't start time just always be program start? So after the first time it is greater than waitForOptimize it will always be greater?

Yes. This just avoids optimizing compactions for the first hour after start

devanbenz

LGTM

Limit number of concurrent optimized compactions so that level compactions do not get starved. Starved level compactions result in a sudden increase in disk usage. Add [data] max-concurrent-optimized-compactions for configuring maximum number of concurrent optimized compactions. Default value is 1. Co-authored-by: davidby-influx <[email protected]> Co-authored-by: devanbenz <[email protected]> Closes: #26315 (cherry picked from commit 66f4dbe)

Co-authored-by: Geoffrey Wossum <[email protected]> Co-authored-by: davidby-influx <[email protected]> Closes: #26315

gwossum added area/tsm kind/bug 1.x labels Apr 23, 2025

gwossum requested review from devanbenz and davidby-influx April 23, 2025 22:39

gwossum assigned gwossum, devanbenz and davidby-influx Apr 23, 2025

davidby-influx added 4 commits April 24, 2025 10:25

chore: fixed spelling of aggressive

f2a73ce

chore: eliminate unused variables

66c08bf

chore: add test for IsGroupOptimized

b00d6ed

fix: break out compaction planning for testability

3680278

davidby-influx reviewed Apr 26, 2025

View reviewed changes

tsdb/engine/tsm1/engine.go Show resolved Hide resolved

davidby-influx reviewed Apr 26, 2025

View reviewed changes

tsdb/engine/tsm1/compact_test.go Outdated Show resolved Hide resolved

devanbenz requested changes Apr 28, 2025

View reviewed changes

devanbenz and others added 11 commits April 30, 2025 07:24

feat: working on modifying the struct we use for level5Groups

0acff0f

feat: This commit removes level5Aggressive bool and implements a new

22dd2d2

abstraction on top of CompactionGroups that will include the points per block required during compaction.

chore: Naming problem when calling rename fixed

ece6226

chore: renaming Group -> group

2ccc283

chore: Group -> group

a4891ff

feat: Merging in changes to upstream branch

a4cb4a3

chore: fix whitespace

0b7c3e2

feat: Modify existing tests to use PlanCompactions()

ad5945a

feat: Working on refactoring old tests to use PlanCompactions()

a0d45de

feat: Merge branch 'db/26315/aggressive_compact_changes' into gw/2631…

779b94c

…5/opt_compact_limiter

chore: Remove artifacts from old tests

071831b

devanbenz marked this pull request as ready for review May 5, 2025 16:39

devanbenz added 2 commits May 5, 2025 13:19

feat: Add FullyCompacted tests for IsIdle

e3a0c26

fix: Need to set block counts in tests before FullyCompacted is called

ead6834

davidby-influx requested changes May 5, 2025

View reviewed changes

tsdb/engine/tsm1/engine.go Outdated Show resolved Hide resolved

tsdb/engine/tsm1/engine.go Show resolved Hide resolved

devanbenz added 3 commits May 5, 2025 14:01

feat: Set DefaultPointsPerBlock for level 1 groups

c97572e

feat: I realized that we were not capturing the case where we have

b528f59

a level5 compaction group that has 1 > tsm file(s) that are over max points per block. This will ensure we do not unwind any tsm files already > max points per block

feat: Remove IsSingleGeneration()

7eec59a

davidby-influx reviewed May 5, 2025

View reviewed changes

davidby-influx previously approved these changes May 5, 2025

View reviewed changes

gwossum commented May 6, 2025

View reviewed changes

chore: code cleanup for clarity

9551b26

davidby-influx dismissed their stale review via 9551b26 May 6, 2025 18:27

devanbenz approved these changes May 6, 2025

View reviewed changes

devanbenz merged commit 66f4dbe into master-1.x May 6, 2025
9 checks passed

devanbenz deleted the gw/26315/opt_compact_limiter branch May 6, 2025 20:42

devanbenz added a commit that referenced this pull request May 6, 2025

fix: limit number of concurrent optimized compactions (#26319) (#26358)

f738344

Co-authored-by: Geoffrey Wossum <[email protected]> Co-authored-by: davidby-influx <[email protected]> Closes: #26315

charlesthomas mentioned this pull request Jul 7, 2025

ct master 1.x circle selfhosted #26591

Closed

		@@ -2704,6 +2704,9 @@ func TestDefaultPlanner_PlanOptimize_Test(t *testing.T) {

		cp := tsm1.NewDefaultPlanner(ffs, tsdb.DefaultCompactFullWriteColdDuration)

		compacted, _ := cp.FullyCompacted()

fix: limit number of concurrent optimized compactions #26319

fix: limit number of concurrent optimized compactions #26319

Uh oh!

Conversation

gwossum commented Apr 23, 2025

Uh oh!

davidby-influx commented Apr 26, 2025

Uh oh!

Uh oh!

Uh oh!

devanbenz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

devanbenz Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devanbenz commented Apr 28, 2025

Uh oh!

davidby-influx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

davidby-influx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidby-influx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devanbenz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

devanbenz Apr 28, 2025 •

edited

Loading