Skip to content

Conversation

Lazin
Copy link
Contributor

@Lazin Lazin commented Oct 13, 2025

This PR adds plumbing required to pass expected term using the raft::replicate_options struct instead of using a dedicated replicate or replicate_in_stages overload. It uses this field to replicate placeholder batches with specified term.

The alternative is to keep replicate and repicate_in_stages overloads. I don't like this because the same parameter can now be passed using two different means (vial parameter or via the replicate_options struct).

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

  • none

@Lazin Lazin requested review from Copilot and mmaslankaprv and removed request for Copilot October 13, 2025 12:30
@Lazin Lazin requested review from bharathv and rockwotj October 13, 2025 12:31
@Lazin Lazin force-pushed the ct/replicate-with-term branch from 43f096d to b2d0a43 Compare October 13, 2025 13:49
@Copilot Copilot AI review requested due to automatic review settings October 13, 2025 13:49
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the Raft replication API to consolidate term-based replication by moving the expected_term parameter from dedicated method overloads into the replicate_options struct. This eliminates the need for separate replicate and replicate_in_stages overloads that accepted a term parameter.

Key changes:

  • Added expected_term field to replicate_options struct with a new constructor
  • Removed term-based overloads from consensus methods
  • Updated all call sites to pass expected term via replicate_options instead of as a separate parameter

Reviewed Changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/v/raft/replicate.h Added expected_term field and new constructor to replicate_options
src/v/raft/consensus.h Removed term-based method overloads, updated documentation
src/v/raft/consensus.cc Removed term-based overload implementations
src/v/raft/replicate_batcher.h Removed expected_term parameter from method signatures
src/v/raft/replicate_batcher.cc Updated to use expected_term from replicate_options
Multiple test and implementation files Updated call sites to use new API pattern

Comment on lines 1057 to 1059
if (!opts.expected_term.has_value()) {
opts.expected_term = _insync_term;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we do a term check here if it is specified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer have the check in the old place. The fewer surprises the better.

Comment on lines +1214 to +1219
if (!opts.expected_term.has_value()) {
opts.expected_term = _insync_term;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else verify term

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to propagate it and let rm_stm to validate normally and maybe step down there. I want to retain all the logging that happens prior to step down.

, batches(std::move(batches))
, batch_id(batch_id)
, opts(update_replicate_options(opts))
, opts(update_replicate_options(opts, this->partition->term()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want the term returned in the fence. Otherwise you need to check the term here against the fence term

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

auto unit = co_await _state_lock.hold_read_lock();
if (bid.is_transactional) {
co_return co_await transactional_replicate(bid, std::move(batch));
co_return co_await transactional_replicate(bid, std::move(batch), opts);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a behavioral change here. transactions didn't honor replicate_options.timeout prior to this patch and now they do, since opts are passed along (one can argue they should but I think that should be out of this PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed that, now I'm passing the optional there so there is nothing to get this timeout from


auto r = co_await _raft->replicate(
synced_term, std::move(batch), make_replicate_options());
auto r = co_await _raft->replicate(std::move(batch), opts);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a behavioral change here.. make_replicate_options() sets force_flush = true (to bypass write caching overrides). Would be better to use make_replicate_options() in all transactional code paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Oct 13, 2025

Retry command for Build#74075

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/write_caching_fi_e2e_test.py::WriteCachingFailureInjectionE2ETest.test_crash_all@{"use_transactions":true}

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Oct 13, 2025

CI test results

test results on build#74075
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
AuditLogTestKafkaApi test_no_auth_enabled {"audit_transport_mode": "rpc"} integration https://buildkite.com/redpanda/redpanda/builds/74075#0199ddf4-08fd-4c83-b49b-395afd09a3f2 FLAKY 19/21 upstream reliability is '100.0'. current run reliability is '90.47619047619048'. drift is 9.52381 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestKafkaApi&test_method=test_no_auth_enabled
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "compaction_mode": "adjacent_merge", "enable_failures": true, "mixed_versions": true, "with_iceberg": false} integration https://buildkite.com/redpanda/redpanda/builds/74075#0199ddf4-08f5-405b-b3d6-f6cee8be6166 FLAKY 20/21 upstream reliability is '97.10982658959537'. current run reliability is '95.23809523809523'. drift is 1.87173 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RandomNodeOperationsTest&test_method=test_node_operations
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/74075#0199ddf4-08fd-4c83-b49b-395afd09a3f2 FLAKY 14/21 upstream reliability is '82.62867647058823'. current run reliability is '66.66666666666666'. drift is 15.96201 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": true} integration https://buildkite.com/redpanda/redpanda/builds/74075#0199ddf2-243e-429a-bf1b-721d95e47920 FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": true} integration https://buildkite.com/redpanda/redpanda/builds/74075#0199ddf4-08fe-4e40-9703-b5841285ce2d FAIL 0/21 The test has failed across all retries https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
test results on build#74099
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkBasicTests test_rapid_shadow_link_toggling null integration https://buildkite.com/redpanda/redpanda/builds/74099#0199df2d-fd44-4011-8252-f68e7b111e32 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkBasicTests&test_method=test_rapid_shadow_link_toggling
SegmentMsTest test_segment_rolling_with_retention_consumer null integration https://buildkite.com/redpanda/redpanda/builds/74099#0199df2d-fd49-47c8-8832-be6e4103d2db FLAKY 16/21 upstream reliability is '91.90897597977245'. current run reliability is '76.19047619047619'. drift is 15.7185 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SegmentMsTest&test_method=test_segment_rolling_with_retention_consumer

Lazin added 2 commits October 13, 2025 15:11
This commit adds a plumbing for the expected term. Previously, the
expected term was passed directly into 'raft::replicate' and
'raft::replicate_in_stages' (2x overloads of each method existed).

This commit adds expected_term to replicate_options.
The field is supposed to be used to replace the 'replicate' method
overloads that take expected term.

Since now we have expected_term field in the replicate_options we no
longer need to pass this value directly. This commit removes overloads
that accept exected term and propagates the value all the way up to the
batcher and updates use cases.

Signed-off-by: Evgeny Lazin <[email protected]>

Signed-off-by: Evgeny Lazin <[email protected]>
@Lazin Lazin force-pushed the ct/replicate-with-term branch from b2d0a43 to 54966ef Compare October 13, 2025 19:11
@Lazin Lazin requested review from bharathv and rockwotj October 13, 2025 19:11
Copy link
Contributor

@rockwotj rockwotj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Will let the replication bless with an approval

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also LGTM let's wait on @bharathv / @mmaslankaprv to approve. It seems like the primary change is in the second commit? If so, and if you don't get timely approval from replication team then please just resubmit the first commit as a separate PR for replication team review.

Copy link
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

@Lazin Lazin enabled auto-merge October 13, 2025 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants