Skip to content

Schema evolution test intermittent failure fix. #5512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bekadavis9
Copy link
Contributor

@bekadavis9 bekadavis9 commented May 14, 2025

There has been an intermittent failure in the test C++ API: SchemaEvolution, drop fixed attribute and add back as var-sized. It's been reproduced on M2 and M3 Macs in both Release and Debug modes, and is most prevalent in the nightlies under a series of cmake parameters.

This fix was caught on my M2 Mac in Debug mode, and seems to be a copypasta oversight.

After making the change, I was unable to reproduce the original failure again. I tried under 4 different builds, each of which saw the failure in a nightly job. Each test was run 5000 times.

  • RelWithDebugInfo with asan; experimental-features, serialization
  • Release with asan
    • experimental-features, serialization
    • verbose, ccache, experimental-features, serialization
    • verbose, ccache, serialization

Fixes CORE-49


TYPE: BUG
DESC: Bug fix: C++ SchemaEvolution test intermittent failure.

@bekadavis9 bekadavis9 requested a review from ypatia May 14, 2025 01:05
@@ -267,7 +267,6 @@ void TileCellSlabIter<T>::init_cell_slab_lengths() {
ranges_[dim_num_ - 1][i].end_ - ranges_[dim_num_ - 1][i].start_ + 1;
}
} else {
assert(layout_ == Layout::COL_MAJOR);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am skeptical of this, for two reasons:

First, assert is almost always optimized out in release builds. It uses a C preprocessor directive to expand to nothing. We have observed the failure in release builds, which means either this assert is not at fault, or we are not abiding that pattern.

Second, if the assert did fail, then we would see more catastrophic reporting of a SIGABRT from Catch2 or CI. Either that or we need to find a new testing library.

Does this assert fail deterministically in the test? Or is there a memory corruption which causes it to occasionally fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated in the PR description, this change was specifically caught in Debug mode; I hit it every time I ran the test. I was unable to reproduce any other failure in Release mode in 20,000 iterations of the test across 4 different builds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like three separate problems to me.

  1. The intermittent failure, which we observe in release mode; the assert does not run here and removing it will not fix the intermittent failure.
  2. The correctness of this assert, which we see failing deterministically in debug mode; I would like to see more description here about what the assert claims, why it is failing and whether removing it is correct.
  3. The fact that this fails deterministically in debug mode; we should ensure somehow that our asserts run in CI rather than in development only. I don't think this is an action item for this PR but does require someone to follow up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well look what I stumbled into:

if(TILEDB_ASSERTIONS)
  # On non-Debug builds cmake automatically defines NDEBUG, so we
  # explicitly undefine it: 
 
add_compile_options($<$<AND:$<NOT:$<CONFIG:Debug,RelWithDebInfo>>,$<OR:$<COMPILE_LANGUAGE:C>,$<COMPILE_LANGUAGE:CXX>>>:-UNDEBUG>)

It looks like this is off by default, but a release build with this on would compile in the asserts. Does this get turned on in CI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TILEDB_ASSERTIONS are default-disabled. AFAICT, no failing builds have --enable-assertions in bootstrap. This was intended to be a separate "fix" of something discovered in Debug, but I think you may be right that it's worth deeper discovery.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discovered that recently as well, in relation to this discussion
sorry I forgot to circle back here and mention it :/

Copy link
Member

@ypatia ypatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this will fix the actual intermittent failure but it will allow us to stop failing on this assertion. I vote for getting this merged today and continue the investigation if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants