Scope replication-failure read-only to repo/project, not the whole server#1305
Scope replication-failure read-only to repo/project, not the whole server#1305ikhoon wants to merge 9 commits into
Conversation
…rver
Motivation:
Central Dogma is a multi-tenant store: a single replica hosts many
projects and repositories belonging to unrelated tenants. When a
follower fails to replay a single log (state divergence, lock
acquisition timeout, or transient I/O error), the existing failure
path puts the whole server into read-only mode. A problem in one
repository then propagates to every other tenant on that replica:
writes stop everywhere until an operator investigates and lifts the
server-wide read-only state.
The goal of this change is to shrink that blast radius. A replay
failure flips only the smallest enclosing scope (repository, project,
or server) to read-only, and the executor keeps replicating unrelated
logs for everyone else. The read-only state is persisted so it
survives restarts — in `dogma/dogma` for repo/project scope, and in
a properties file for server scope. The existing administrative API
can lift it once the operator has investigated.
Modifications:
- Add `UpdateProjectStatusCommand` and `UpdateRepositoryStatusCommand`,
routed through the same replication pipeline as ordinary commands so
scope changes converge across the cluster. Their effect lands in
`dogma/dogma/status/{project}/{repo}.json` — the same storage as
ordinary commands, intentionally: a replay failure of the status
update itself is then treated as a `dogma/dogma` failure and
escalates to server scope.
- Carry `commandType`, `projectName`, and `repoName` on every
`LogMeta`.
- Add `ReplicationLogContext` attached to `ReplicationException` so
`ZooKeeperCommandExecutor.handleReplicationFailure()` can pick the
read-only scope:
- null project name (root commands or `dogma/dogma`) → server
- non-null project name with null repo (or the dogma metadata
repo) → project
- otherwise → repository
- Add `RepoStatusManager` to manage the read-only status of projects
and repositories; remove `MetadataService.updateRepositoryStatus()`
and migrate callers.
- On a replay error, skip past the failing log (so subsequent
unrelated logs keep replaying) and issue the scope-appropriate
`Update(Project|Repository)StatusCommand`. Fall back to a
server-wide read-only state only if recovery itself fails.
- Wire `StandaloneCommandExecutor` to consult `RepoStatusManager` on
every push, so a per-repo / per-project read-only state rejects
writes with `ReadOnlyException` before reaching the storage layer.
- Rename `DefaultCrud*` → `ReplicatingCrud*` and introduce
`StandaloneCrud*` to bypass replication when the standalone executor
applies state updates directly.
- Move `ServerStatusManager` to the internal package.
- Add `ReplicationStatus` to describe the status of projects and
repositories, and migrate `RepositoryStatus` in `RepositoryDto` to
`ReplicationStatus`.
Result:
- A replication failure no longer puts the whole server into read-only
mode; only the affected project or repository becomes read-only.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds ReplicationStatus and RepoStatusManager; replaces git-scoped CRUD with replicating/standalone CRUD; adds update-project/repository status commands; wires repo-status checks through executors and HTTP API; refactors ZooKeeperCommandExecutor to carry replication log context and apply scoped read-only failure handling. ChangesRepo status and DTOs
Repo status manager & models
CRUD abstractions and implementations
Commands and executor wiring
HTTP API and variable service
Replication log & ZooKeeper replay
Tests and other small changes
Sequence Diagram (high level) sequenceDiagram
participant Client
participant RepositoryServiceV1
participant RepoStatusManager
participant CommandExecutor
participant ZooKeeper
Client->>RepositoryServiceV1: API call (create/patch/update status)
RepositoryServiceV1->>RepoStatusManager: query/update ReplicationStatus
RepositoryServiceV1->>CommandExecutor: execute status update command
CommandExecutor->>ZooKeeper: store replication log (LogMeta + bytes)
ZooKeeper->>CommandExecutor: deliver logs for replay (loadLog -> ReplicationLogContext)
CommandExecutor->>RepoStatusManager: handleReplicationFailure -> emit scoped Update*Status
Estimated code review effort 🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/src/main/java/com/linecorp/centraldogma/server/CentralDogma.java (1)
644-678:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winMove
repoStatusManager.initialize()beforeexecutor.start()to load persisted repository status before accepting commands.The
RepoStatusManagerconstructor initializes an emptystatusMap. Theinitialize()method only registers a listener but does not synchronously load existing persisted status files from disk.In the replicating path (lines 651–676), the executor starts at line 653 with an unpopulated cache. Any write commands processed between lines 653–671 will invoke
repoStatusManager.isWritable()(StandaloneCommandExecutor line 138) against an empty cache, potentially bypassing scoped read-only restrictions that were persisted before restart.The listener pattern is reactive and only fires on future changes, not on registration. To ensure all persisted read-only states are loaded before the executor begins accepting commands, initialize the repo status manager before starting the executor.
The non-replicating path (lines 646–650) is safe since
initialize()runs before the executor is returned; apply the same ordering to the replicating path (lines 651–676).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/src/main/java/com/linecorp/centraldogma/server/CentralDogma.java` around lines 644 - 678, The repoStatusManager.initialize() must be called before starting the executor so persisted repo statuses are loaded into the cache prior to accepting commands; move the call to repoStatusManager.initialize() to run before invoking executor.start() in the replicating branch (the block that currently calls executor.start(), waits on startFuture, then calls projectInitializer.initialize() and repoStatusManager.initialize()), ensuring the order is repoStatusManager.initialize(), projectInitializer.initialize() (as appropriate), then start the executor (executor.start()) and proceed with the existing start/wait/stop logic and finally setMirrorAccessControllerRepository(pm, executor).
🧹 Nitpick comments (1)
server/src/main/java/com/linecorp/centraldogma/server/CentralDogma.java (1)
645-678: 💤 Low valueOptional: deduplicate
repoStatusManager.initialize()andsetMirrorAccessControllerRepository(...)calls.
repoStatusManager.initialize()is invoked in all three branches (lines 647, 671, 675), andsetMirrorAccessControllerRepository(pm, executor)runs in both the non-replicating return path (line 648) and after the try/catch (line 677). Consolidating into a single tail block reduces the chance of one path forgetting to run them when this method evolves.♻️ Sketch
final ServerStatus initialServerStatus = statusManager.serverStatus(); executor.setWritable(initialServerStatus.writable()); - if (!initialServerStatus.replicating()) { - projectInitializer.initializeInReadOnlyMode(); - repoStatusManager.initialize(); - setMirrorAccessControllerRepository(pm, executor); - return executor; - } - try { - final CompletableFuture<Void> startFuture = executor.start(); - while (!startFuture.isDone()) { - if (numPendingStopRequests.get() > 0) { - // Stop request has been issued. - executor.stop().get(); - break; - } - try { - startFuture.get(100, TimeUnit.MILLISECONDS); - } catch (TimeoutException unused) { - // Taking long time .. + if (!initialServerStatus.replicating()) { + projectInitializer.initializeInReadOnlyMode(); + } else { + try { + final CompletableFuture<Void> startFuture = executor.start(); + while (!startFuture.isDone()) { + if (numPendingStopRequests.get() > 0) { + executor.stop().get(); + break; + } + try { + startFuture.get(100, TimeUnit.MILLISECONDS); + } catch (TimeoutException unused) { + // Taking long time .. + } } + startFuture.get(); + projectInitializer.initialize(); + } catch (Exception e) { + logger.warn("Failed to start the command executor. Entering read-only.", e); + projectInitializer.initializeInReadOnlyMode(); } - // Trigger the exception if any. - startFuture.get(); - projectInitializer.initialize(); - repoStatusManager.initialize(); - } catch (Exception e) { - logger.warn("Failed to start the command executor. Entering read-only.", e); - projectInitializer.initializeInReadOnlyMode(); - repoStatusManager.initialize(); } + repoStatusManager.initialize(); setMirrorAccessControllerRepository(pm, executor); return executor;(Note: only adopt this if the ordering concern raised in the other comment is resolved — i.e. it is fine to defer
repoStatusManager.initialize()until afterexecutor.start()in the non-replicating path as well.)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/src/main/java/com/linecorp/centraldogma/server/CentralDogma.java` around lines 645 - 678, Consolidate duplicated cleanup calls by removing the early invocations of repoStatusManager.initialize() and setMirrorAccessControllerRepository(pm, executor) inside the non-replicating branch and inside the try/catch, and instead call repoStatusManager.initialize() and setMirrorAccessControllerRepository(pm, executor) once at the end just before returning executor; ensure the non-replicating path no longer returns early (remove the premature return in the !initialServerStatus.replicating() branch) so that the single tail block always runs, and keep the existing executor.setWritable(initialServerStatus.writable()) and projectInitializer.initialize()/initializeInReadOnlyMode() logic intact (adjust projectInitializer.initializeInReadOnlyMode() call to remain in the non-replicating path as currently done).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@it/xds-k8s-node-ip-extractor/build.gradle`:
- Line 6: Remove the unnecessary test dependency on armeria.grpc by deleting or
replacing the line that adds testImplementation libs.armeria.grpc; tests only
use Armeria HTTP types (e.g., AggregatedHttpResponse, HttpStatus,
RequestHeaders) and a grpc-status header check so the com.linecorp.armeria.grpc
artifact is not required—ensure tests still compile using existing
armeria.junit5 or other Armeria HTTP dependencies instead of adding
armeria.grpc.
In
`@server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationLogContext.java`:
- Around line 65-81: equals() compares the byte[] field bytes by content
(Objects.deepEquals), but hashCode() uses Objects.hash(...) which uses the
byte[] identity; update hashCode in ReplicationLogContext to compute a
content-based hash for bytes (e.g., int bytesHash = Arrays.hashCode(bytes)) and
include that int in the returned hash (for example: return
Objects.hash(replayRevision, meta, log, bytesHash)); also add the necessary
import for java.util.Arrays. Ensure you modify the hashCode() method only and
keep equals() as-is.
In
`@server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperScopedReadonlyIntegrationTest.java`:
- Around line 614-619: The test performs an unsafe cast in start(PluginContext):
ZooKeeperCommandExecutor.unwrap() is returning the delegate from Replica which
is an anonymous AbstractCommandExecutor, so casting to StandaloneCommandExecutor
can throw ClassCastException; to fix, either stop casting and store the
unwrapped executor in a compatible type (e.g., AbstractCommandExecutor or
CommandExecutor) and adjust uses in ZooKeeperScopedReadonlyIntegrationTest to
work with that API, or modify the Replica test setup so its delegate is an
actual StandaloneCommandExecutor implementation; update the start(PluginContext)
code (and/or Replica initialization) to consistently use the same executor type
rather than forcing StandaloneCommandExecutor.
In
`@testing-internal/src/main/java/com/linecorp/centraldogma/testing/internal/ProjectManagerExtension.java`:
- Line 89: The ServerStatusManager field is being re-created after the executor
is constructed, causing two different instances; move the initialization of the
serverStatusManager field (the ServerStatusManager constructor call) so it
occurs before newCommandExecutor(...) is invoked, and modify newCommandExecutor
to use the existing serverStatusManager field instead of constructing a new
ServerStatusManager instance; ensure the public serverStatusManager() getter and
any places that previously relied on a local/new instance now reference the
single shared serverStatusManager field.
- Around line 172-176: The method currently creates a local ServerStatusManager
(final ServerStatusManager serverStatusManager = new
ServerStatusManager(dataDir)) and passes it into the StandaloneCommandExecutor,
causing two distinct instances instead of reusing the field initialized earlier
in before(); replace the local instantiation with the existing field
serverStatusManager so StandaloneCommandExecutor, RepoStatusManager, and any
other consumers use the single shared ServerStatusManager instance already set
up in before().
---
Outside diff comments:
In `@server/src/main/java/com/linecorp/centraldogma/server/CentralDogma.java`:
- Around line 644-678: The repoStatusManager.initialize() must be called before
starting the executor so persisted repo statuses are loaded into the cache prior
to accepting commands; move the call to repoStatusManager.initialize() to run
before invoking executor.start() in the replicating branch (the block that
currently calls executor.start(), waits on startFuture, then calls
projectInitializer.initialize() and repoStatusManager.initialize()), ensuring
the order is repoStatusManager.initialize(), projectInitializer.initialize() (as
appropriate), then start the executor (executor.start()) and proceed with the
existing start/wait/stop logic and finally
setMirrorAccessControllerRepository(pm, executor).
---
Nitpick comments:
In `@server/src/main/java/com/linecorp/centraldogma/server/CentralDogma.java`:
- Around line 645-678: Consolidate duplicated cleanup calls by removing the
early invocations of repoStatusManager.initialize() and
setMirrorAccessControllerRepository(pm, executor) inside the non-replicating
branch and inside the try/catch, and instead call repoStatusManager.initialize()
and setMirrorAccessControllerRepository(pm, executor) once at the end just
before returning executor; ensure the non-replicating path no longer returns
early (remove the premature return in the !initialServerStatus.replicating()
branch) so that the single tail block always runs, and keep the existing
executor.setWritable(initialServerStatus.writable()) and
projectInitializer.initialize()/initializeInReadOnlyMode() logic intact (adjust
projectInitializer.initializeInReadOnlyMode() call to remain in the
non-replicating path as currently done).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b84f0604-e96f-4cc2-ad7c-e9becac91db8
📒 Files selected for processing (52)
.gitignorecommon/src/main/java/com/linecorp/centraldogma/common/ReplicationStatus.javacommon/src/main/java/com/linecorp/centraldogma/common/RepositoryStatus.javacommon/src/main/java/com/linecorp/centraldogma/common/jsonpatch/JsonPatchOperation.javacommon/src/main/java/com/linecorp/centraldogma/internal/api/v1/RepositoryDto.javait/xds-k8s-node-ip-extractor/build.gradleserver/src/main/java/com/linecorp/centraldogma/server/CentralDogma.javaserver/src/main/java/com/linecorp/centraldogma/server/command/Command.javaserver/src/main/java/com/linecorp/centraldogma/server/command/CommandExecutorStatusManager.javaserver/src/main/java/com/linecorp/centraldogma/server/command/CommandType.javaserver/src/main/java/com/linecorp/centraldogma/server/command/ProjectCommand.javaserver/src/main/java/com/linecorp/centraldogma/server/command/StandaloneCommandExecutor.javaserver/src/main/java/com/linecorp/centraldogma/server/command/UpdateProjectStatusCommand.javaserver/src/main/java/com/linecorp/centraldogma/server/command/UpdateRepositoryStatusCommand.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/api/DtoConverter.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/api/RepositoryServiceV1.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/api/UpdateRepositoryStatusRequest.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/api/sysadmin/MirrorAccessControlRequest.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/api/sysadmin/ServerStatusService.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/api/variable/Templater.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/api/variable/VariableServiceV1.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/management/RepoStatusManager.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/management/RepositoryState.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/management/ServerStatusManager.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/management/package-info.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/mirror/DefaultMirrorAccessController.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/replication/LogMeta.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationException.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/replication/ReplicationLogContext.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/replication/ScopedReadOnlyAppliedException.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/storage/repository/crud/CrudContext.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/storage/repository/crud/CrudOperation.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/storage/repository/crud/CrudRepository.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/storage/repository/crud/ReplicatingCrudOperation.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/storage/repository/crud/ReplicatingCrudRepository.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/storage/repository/crud/StandaloneCrudOperation.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/storage/repository/crud/StandaloneCrudRepository.javaserver/src/main/java/com/linecorp/centraldogma/server/internal/storage/repository/crud/package-info.javaserver/src/main/java/com/linecorp/centraldogma/server/metadata/MetadataService.javaserver/src/main/java/com/linecorp/centraldogma/server/metadata/RepositoryMetadata.javaserver/src/test/java/com/linecorp/centraldogma/server/internal/api/RepositoryServiceV1Test.javaserver/src/test/java/com/linecorp/centraldogma/server/internal/api/auth/RequiresRoleTest.javaserver/src/test/java/com/linecorp/centraldogma/server/internal/management/RepoStatusManagerTest.javaserver/src/test/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutorTest.javaserver/src/test/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperScopedReadOnlyTest.javaserver/src/test/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperScopedReadonlyIntegrationTest.javaserver/src/test/java/com/linecorp/centraldogma/server/internal/storage/repository/git/ReplicatingCrudRepositoryTest.javaserver/src/test/java/com/linecorp/centraldogma/server/management/ServerStatusManagerTest.javaserver/src/test/java/com/linecorp/centraldogma/server/metadata/MetadataServiceTest.javatesting-internal/src/main/java/com/linecorp/centraldogma/testing/internal/CrudRepositoryExtension.javatesting-internal/src/main/java/com/linecorp/centraldogma/testing/internal/ProjectManagerExtension.java
💤 Files with no reviewable changes (2)
- server/src/test/java/com/linecorp/centraldogma/server/metadata/MetadataServiceTest.java
- server/src/main/java/com/linecorp/centraldogma/server/metadata/MetadataService.java
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (1)
896-947:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winReplace
assertwith a hard null-check or makeloadLognon-nullable.
loadLog(...)is declared@Nullable(line 1206), so static analyzers will rightfully assumelogContexthere can be null, yetlogContext.log()on line 903 would NPE if it ever were. Theasserton line 904 is a no-op in production (assertions are off by default), so it provides no real protection. Either drop@NullablefromloadLog(its body either returns a non-null context or throws) or guard with an explicitrequireNonNull/if (logContext == null) throw ...here. See the related comment onloadLog.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java` around lines 896 - 947, The assert on logContext in ZooKeeperCommandExecutor is ineffective in production because loadLog(...) is annotated `@Nullable`; replace the assert with an explicit null-check or make loadLog non-nullable: after calling ReplicationLogContext logContext = loadLog(nextRevision) validate logContext (e.g., Objects.requireNonNull(logContext, "loadLog returned null for revision " + nextRevision) or if (logContext == null) throw new ReplicationException("missing log for revision " + nextRevision, logContext)); ensure you reference the same ReplicationLogContext variable and throw a ReplicationException (or similar) so subsequent code using logContext.log() / l.command() is safe.
🧹 Nitpick comments (2)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java (2)
1228-1234: 💤 Low valueRedundant
setBytescall before the decompression branch.
logContext.setBytes(bytes)on line 1228 stores the still-compressed bytes, and line 1234 unconditionally overwrites it withbytesagain (decompressed whencompressed == TRUE, identical otherwise). The first call is dead — either remove it or move the assignment to occur only once, after the conditional decompression.♻️ Proposed fix
assert logMeta.size() == offset; - logContext.setBytes(bytes); final Boolean compressed = logMeta.compressed(); if (Boolean.TRUE.equals(compressed)) { bytes = Zstd.decompress(bytes); } logContext.setBytes(bytes);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java` around lines 1228 - 1234, The call logContext.setBytes(bytes) before checking compressed is redundant because bytes is overwritten after potential decompression; in ZooKeeperCommandExecutor adjust the logic to set logContext bytes only once after handling compression: remove the early logContext.setBytes(bytes) (or move that assignment to after the if (Boolean.TRUE.equals(logMeta.compressed())) { bytes = Zstd.decompress(bytes); }) so that logContext.setBytes(bytes) is invoked only once with the final byte array; reference variables: logContext, logMeta, compressed, and Zstd.decompress.
1206-1240: ⚡ Quick win
@Nullableis inconsistent with the implementation.
loadLognever returnsnull: it either returns the populatedlogContext(line 1237) or throws aReplicationException(line 1239). The@Nullableannotation misleads callers (e.g.replayLogsat lines 902-904, which uses anassertrather than a real null check) and any static-analysis tooling. Drop the annotation, which also lets you remove theassert l != nullat line 904.♻️ Proposed fix
- `@Nullable` `@VisibleForTesting` ReplicationLogContext loadLog(long revision) {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java` around lines 1206 - 1240, The method loadLog currently never returns null (it always returns a ReplicationLogContext or throws), so remove the misleading `@Nullable` annotation on loadLog (and any related import), update callers like replayLogs which currently do an assert l != null to rely on the non-null return (remove that assert and any defensive null-branches), and adjust any static-analysis or javadoc if present to reflect that loadLog, which returns ReplicationLogContext and populates logContext, is non-null and will throw ReplicationException on failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java`:
- Around line 1180-1191: The current directExecution branch calls
delegate.execute(ExecutionContext.empty(), command).get() while holding the
per-path InterProcessMutex, which can block indefinitely; change this to use a
bounded wait (e.g., delegate.execute(...).get(timeout, unit)) and handle
TimeoutException by logging the timeout (include scope and command details),
calling stopLater(), and avoiding storing the ReplicationLog; ensure other
exceptions are still caught by the existing catch block so the existing
logger.error and stopLater() behavior remains for failures.
- Around line 1178-1179: The logger.error message in ZooKeeperCommandExecutor
currently concatenates two string literals without a separator and contains a
typo — change the message passed to logger.error (the string around the
logger.error(...) call) to include a separator (e.g., a space or comma/newline)
between sentences and correct "attemping" to "attempting", preserving the SLF4J
placeholders and the trailing Throwable argument (i.e., the message should read
something like "Failed to replicate a log; attempting to switch to read-only
mode. scope: {}, logContext: {}" and leave the exception as the final
parameter).
- Around line 1145-1153: In handleReplicationFailure in class
ZooKeeperCommandExecutor, update the logger.error call that currently logs "Fail
to replicate a log; entering read-only mode." to use the correct past-tense
wording "Failed to replicate a log; entering read-only mode." so it matches
other messages; locate the logger.error(...) invocation inside the
handleReplicationFailure(ReplicationException exception, boolean
directExecution) method and change the message string accordingly.
---
Outside diff comments:
In
`@server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java`:
- Around line 896-947: The assert on logContext in ZooKeeperCommandExecutor is
ineffective in production because loadLog(...) is annotated `@Nullable`; replace
the assert with an explicit null-check or make loadLog non-nullable: after
calling ReplicationLogContext logContext = loadLog(nextRevision) validate
logContext (e.g., Objects.requireNonNull(logContext, "loadLog returned null for
revision " + nextRevision) or if (logContext == null) throw new
ReplicationException("missing log for revision " + nextRevision, logContext));
ensure you reference the same ReplicationLogContext variable and throw a
ReplicationException (or similar) so subsequent code using logContext.log() /
l.command() is safe.
---
Nitpick comments:
In
`@server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.java`:
- Around line 1228-1234: The call logContext.setBytes(bytes) before checking
compressed is redundant because bytes is overwritten after potential
decompression; in ZooKeeperCommandExecutor adjust the logic to set logContext
bytes only once after handling compression: remove the early
logContext.setBytes(bytes) (or move that assignment to after the if
(Boolean.TRUE.equals(logMeta.compressed())) { bytes = Zstd.decompress(bytes); })
so that logContext.setBytes(bytes) is invoked only once with the final byte
array; reference variables: logContext, logMeta, compressed, and
Zstd.decompress.
- Around line 1206-1240: The method loadLog currently never returns null (it
always returns a ReplicationLogContext or throws), so remove the misleading
`@Nullable` annotation on loadLog (and any related import), update callers like
replayLogs which currently do an assert l != null to rely on the non-null return
(remove that assert and any defensive null-branches), and adjust any
static-analysis or javadoc if present to reflect that loadLog, which returns
ReplicationLogContext and populates logContext, is non-null and will throw
ReplicationException on failure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d16fe29b-dfa2-4b60-a1d6-7ab13e9e302c
📒 Files selected for processing (2)
server/src/main/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutor.javaserver/src/test/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutorTest.java
🚧 Files skipped from review as they are similar to previous changes (1)
- server/src/test/java/com/linecorp/centraldogma/server/internal/replication/ZooKeeperCommandExecutorTest.java
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@server/src/main/java/com/linecorp/centraldogma/server/CentralDogma.java`:
- Around line 1436-1440: The leader-scoped PluginGroup instances
(pluginsForLeaderOnly and pluginsForZoneLeaderOnly) are being closed before the
command executor is stopped, which prevents their stop(...) callbacks from
reliably submitting work; move the calls that close pluginsForLeaderOnly and
pluginsForZoneLeaderOnly to after CentralDogma.this.doStop()/the command
executor shutdown so their release-leadership callbacks can submit stop tasks to
the executor and run to completion.
In `@server/src/main/java/com/linecorp/centraldogma/server/PluginGroup.java`:
- Around line 182-199: The close() method in PluginGroup currently calls
executor.shutdown() and loops on awaitTermination indefinitely, which can hang;
change it to perform a bounded graceful shutdown: call executor.shutdown(), wait
up to a configurable timeout (e.g., total wait of N seconds via repeated
executor.awaitTermination calls or a single awaitTermination), and if the
executor does not terminate within that timeout call executor.shutdownNow() to
cancel stuck tasks; preserve the InterruptedException handling (set interrupted
flag and re-interrupt thread in finally), and update the logger message to
reflect escalation (use executor.shutdownNow() result if helpful).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: cfd0a5d9-8b00-432f-95eb-532578755885
📒 Files selected for processing (2)
server/src/main/java/com/linecorp/centraldogma/server/CentralDogma.javaserver/src/main/java/com/linecorp/centraldogma/server/PluginGroup.java
| void close() { | ||
| executor.shutdown(); | ||
| boolean interrupted = false; | ||
| try { | ||
| while (!executor.isTerminated()) { | ||
| try { | ||
| executor.awaitTermination(1, TimeUnit.SECONDS); | ||
| } catch (InterruptedException ignore) { | ||
| interrupted = true; | ||
| } | ||
| } | ||
| } catch (Exception cause) { | ||
| logger.warn("Failed to shutdown the {}:", executor, cause); | ||
| } finally { | ||
| if (interrupted) { | ||
| Thread.currentThread().interrupt(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Bound close() so a stuck plugin task can't hang shutdown.
shutdown() plus an unbounded awaitTermination() loop means one blocked plugin task can stall server shutdown forever here. This needs a timeout and a shutdownNow() fallback, especially now that CentralDogma calls close() on every plugin group during stop.
⏱️ Suggested shutdown escalation
void close() {
executor.shutdown();
boolean interrupted = false;
try {
- while (!executor.isTerminated()) {
- try {
- executor.awaitTermination(1, TimeUnit.SECONDS);
- } catch (InterruptedException ignore) {
- interrupted = true;
- }
+ if (!executor.awaitTermination(10, TimeUnit.SECONDS)) {
+ executor.shutdownNow();
+ if (!executor.awaitTermination(10, TimeUnit.SECONDS)) {
+ logger.warn("Timed out shutting down {}", executor);
+ }
}
+ } catch (InterruptedException cause) {
+ interrupted = true;
+ executor.shutdownNow();
} catch (Exception cause) {
logger.warn("Failed to shutdown the {}:", executor, cause);
} finally {
if (interrupted) {
Thread.currentThread().interrupt();📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| void close() { | |
| executor.shutdown(); | |
| boolean interrupted = false; | |
| try { | |
| while (!executor.isTerminated()) { | |
| try { | |
| executor.awaitTermination(1, TimeUnit.SECONDS); | |
| } catch (InterruptedException ignore) { | |
| interrupted = true; | |
| } | |
| } | |
| } catch (Exception cause) { | |
| logger.warn("Failed to shutdown the {}:", executor, cause); | |
| } finally { | |
| if (interrupted) { | |
| Thread.currentThread().interrupt(); | |
| } | |
| } | |
| void close() { | |
| executor.shutdown(); | |
| boolean interrupted = false; | |
| try { | |
| if (!executor.awaitTermination(10, TimeUnit.SECONDS)) { | |
| executor.shutdownNow(); | |
| if (!executor.awaitTermination(10, TimeUnit.SECONDS)) { | |
| logger.warn("Timed out shutting down {}", executor); | |
| } | |
| } | |
| } catch (InterruptedException cause) { | |
| interrupted = true; | |
| executor.shutdownNow(); | |
| } catch (Exception cause) { | |
| logger.warn("Failed to shutdown the {}:", executor, cause); | |
| } finally { | |
| if (interrupted) { | |
| Thread.currentThread().interrupt(); | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@server/src/main/java/com/linecorp/centraldogma/server/PluginGroup.java`
around lines 182 - 199, The close() method in PluginGroup currently calls
executor.shutdown() and loops on awaitTermination indefinitely, which can hang;
change it to perform a bounded graceful shutdown: call executor.shutdown(), wait
up to a configurable timeout (e.g., total wait of N seconds via repeated
executor.awaitTermination calls or a single awaitTermination), and if the
executor does not terminate within that timeout call executor.shutdownNow() to
cancel stuck tasks; preserve the InterruptedException handling (set interrupted
flag and re-interrupt thread in finally), and update the logger message to
reflect escalation (use executor.shutdownNow() result if helpful).
Motivation:
Central Dogma is a multi-tenant store: a single replica hosts many projects and repositories belonging to unrelated tenants. When a follower fails to replay a single log (state divergence, lock acquisition timeout, or transient I/O error), the existing failure path puts the whole server into read-only mode. A problem in one repository then propagates to every other tenant on that replica: writes stop everywhere until an operator investigates and lifts the server-wide read-only state.
The goal of this change is to shrink that blast radius. A replay failure flips only the smallest enclosing scope (repository, project, or server) to read-only, and the executor keeps replicating unrelated logs for everyone else. The read-only state is persisted so it survives restarts — in
dogma/dogmafor repo/project scope, and in a properties file for server scope. The existing administrative API can lift it once the operator has investigated.Modifications:
UpdateProjectStatusCommandandUpdateRepositoryStatusCommand, routed through the same replication pipeline as ordinary commands so scope changes converge across the cluster. Their effect lands indogma/dogma/status/{project}/{repo}.json— the same storage as ordinary commands, intentionally: a replay failure of the status update itself is then treated as adogma/dogmafailure and escalates to server scope.commandType,projectName, andrepoNameon everyLogMeta.ReplicationLogContextattached toReplicationExceptionsoZooKeeperCommandExecutor.handleReplicationFailure()can pick the read-only scope:dogma/dogma) → serverRepoStatusManagerto manage the read-only status of projects and repositories; removeMetadataService.updateRepositoryStatus()and migrate callers.Update(Project|Repository)StatusCommand. Fall back to a server-wide read-only state only if recovery itself fails.StandaloneCommandExecutorto consultRepoStatusManageron every push, so a per-repo / per-project read-only state rejects writes withReadOnlyExceptionbefore reaching the storage layer.DefaultCrud*→ReplicatingCrud*and introduceStandaloneCrud*to bypass replication when the standalone executor applies state updates directly.ServerStatusManagerto the internal package.ReplicationStatusto describe the status of projects and repositories, and migrateRepositoryStatusinRepositoryDtotoReplicationStatus.Result: