Skip to content

Conversation

@mregrock
Copy link
Collaborator

Changelog entry

Fix flakky tests

Changelog category

  • Not for changelog (changelog entry is not required)

Description for reviewers

...

Copilot AI review requested due to automatic review settings December 24, 2025 13:44
@ydbot
Copy link
Collaborator

ydbot commented Dec 24, 2025

Run Extra Tests

Run additional tests for this PR. You can customize:

  • Test Size: small, medium, large (default: all)
  • Test Targets: any directory path (default: ydb/)
  • Sanitizers: ASAN, MSAN, TSAN
  • Coredumps: enable for debugging (default: off)
  • Additional args: custom ya make arguments

▶  Run tests

@github-actions
Copy link

github-actions bot commented Dec 24, 2025

2025-12-24 13:47:43 UTC Pre-commit check linux-x86_64-relwithdebinfo for 56a7af3 has started.
2025-12-24 13:47:49 UTC Artifacts will be uploaded here
2025-12-24 13:49:12 UTC ya make is running...
2025-12-24 13:49:58 UTC Check cancelled

@github-actions
Copy link

github-actions bot commented Dec 24, 2025

2025-12-24 13:47:44 UTC Pre-commit check linux-x86_64-release-asan for 56a7af3 has started.
2025-12-24 13:47:53 UTC Artifacts will be uploaded here
2025-12-24 13:49:13 UTC ya make is running...
2025-12-24 13:49:58 UTC Check cancelled

@github-actions
Copy link

🟢 2025-12-24 13:48:03 UTC The validation of the Pull Request description is successful.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes flaky tests by introducing a mechanism to reset the node ID counter used by the test actor runtime. The changes add a new static method to reset the global NextNodeId counter to ensure tests start with predictable node IDs.

Key Changes:

  • Added ResetFirstNodeId() static method to TTestActorRuntimeBase for resetting the node ID counter
  • Modified the CheckV1IsBlocked test to call this reset method before creating a test server
  • Added explicit node_id: 1 in the test's YAML configuration to match the reset state

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
ydb/library/actors/testlib/test_runtime.h Declares the new static ResetFirstNodeId() method
ydb/library/actors/testlib/test_runtime.cpp Implements ResetFirstNodeId() to reset the static NextNodeId variable to 1
ydb/services/config/bsconfig_ut.cpp Adds reset call at test start and explicit node_id configuration to ensure deterministic behavior

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

void AddLocalService(const TActorId& actorId, TActorSetupCmd cmd, ui32 nodeIndex = 0);
virtual void Initialize();
ui32 GetNodeId(ui32 index = 0) const;
ui32 GetNodeCount() const;
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ResetFirstNodeId method lacks documentation explaining its purpose, when it should be used, and why it's necessary. Given that this modifies global state and affects test isolation, it would be helpful to document that this method should be called at the beginning of tests that require a predictable starting node ID, and explain the implications of resetting this shared state.

Suggested change
ui32 GetNodeCount() const;
ui32 GetNodeCount() const;
/**
* Reset the starting node id used for subsequent test runtimes.
*
* This method modifies shared/global state that affects how node ids are
* assigned to test runtime instances. It is intended to be called at the
* beginning of tests that require a predictable first node id (for example,
* when tests assert on concrete node id values or rely on deterministic
* allocation order).
*
* Because this resets shared state, it can affect other tests running in the
* same process. It should therefore be used in a controlled manner, ideally
* as part of test setup before any test runtimes are constructed, and not
* from concurrently executing tests.
*/

Copilot uses AI. Check for mistakes.
return NodeCount;
}

void TTestActorRuntimeBase::ResetFirstNodeId() {
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ResetFirstNodeId method modifies a static variable without any synchronization mechanism. Since NextNodeId is accessed and modified in other methods (such as in the constructor and InitNodes), this could lead to race conditions if tests are run in parallel or if multiple runtime instances are created concurrently. Consider adding a mutex to protect access to NextNodeId, similar to how other shared state is protected with TGuard guard(Mutex) in methods like AllocateLocalId.

Suggested change
void TTestActorRuntimeBase::ResetFirstNodeId() {
void TTestActorRuntimeBase::ResetFirstNodeId() {
TGuard<TMutex> guard(Mutex);

Copilot uses AI. Check for mistakes.
Comment on lines 327 to +328
ui32 GetNodeCount() const;
static void ResetFirstNodeId();
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method name ResetFirstNodeId is misleading because it actually resets NextNodeId, not FirstNodeId. FirstNodeId is an instance variable that stores the starting node ID for a runtime instance, while NextNodeId is a static variable that tracks the next available node ID across all instances. Consider renaming this method to ResetNextNodeId to accurately reflect what it does.

Suggested change
ui32 GetNodeCount() const;
static void ResetFirstNodeId();
ui32 GetNodeCount() const;
// Deprecated: use ResetNextNodeId instead.
static void ResetFirstNodeId();
static void ResetNextNodeId() { ResetFirstNodeId(); }

Copilot uses AI. Check for mistakes.
@mregrock mregrock force-pushed the fix_flakky_bsconfig branch from 24d465a to 0c1ac74 Compare December 24, 2025 13:49
@github-actions
Copy link

github-actions bot commented Dec 24, 2025

2025-12-24 13:53:43 UTC Pre-commit check linux-x86_64-release-asan for c4732c7 has started.
2025-12-24 13:54:21 UTC Artifacts will be uploaded here
2025-12-24 13:56:30 UTC ya make is running...
🟡 2025-12-24 14:44:16 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
11614 11565 0 42 4 3

🟢 2025-12-24 14:44:24 UTC Build successful.
🟢 2025-12-24 14:44:59 UTC ydbd size 3.8 GiB changed* by -176 Bytes, which is <= 0 Bytes vs main: OK

ydbd size dash main: 1182915 merge: c4732c7 diff diff %
ydbd size 4 087 569 592 Bytes 4 087 569 416 Bytes -176 Bytes -0.000%
ydbd stripped size 1 530 591 704 Bytes 1 530 591 576 Bytes -128 Bytes -0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@github-actions
Copy link

github-actions bot commented Dec 24, 2025

2025-12-24 13:54:54 UTC Pre-commit check linux-x86_64-relwithdebinfo for c4732c7 has started.
2025-12-24 13:55:11 UTC Artifacts will be uploaded here
2025-12-24 13:57:18 UTC ya make is running...
🟡 2025-12-24 15:19:43 UTC Some tests failed, follow the links below. Going to retry failed tests...

Details

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
30158 28284 0 15 1670 189

2025-12-24 15:19:57 UTC ya make is running... (failed tests rerun, try 2)
🟡 2025-12-24 15:49:20 UTC Some tests failed, follow the links below. Going to retry failed tests...

Details

Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1798 (only retried tests) 1792 0 1 0 5

2025-12-24 15:49:27 UTC ya make is running... (failed tests rerun, try 3)
🔴 2025-12-24 15:50:44 UTC Some tests failed, follow the links below.

Ya make output | Test bloat | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
180 (only retried tests) 174 0 1 0 5

🟢 2025-12-24 15:50:51 UTC Build successful.
🟢 2025-12-24 15:51:12 UTC ydbd size 2.3 GiB changed* by -32 Bytes, which is <= 0 Bytes vs main: OK

ydbd size dash main: 1182915 merge: c4732c7 diff diff %
ydbd size 2 487 005 848 Bytes 2 487 005 816 Bytes -32 Bytes -0.000%
ydbd stripped size 529 416 960 Bytes 529 416 960 Bytes 0 Bytes 0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants