Skip to content

Test Bug: Rare failures in TestTransactions_Mem #1298

@paulirwin

Description

@paulirwin

Is there an existing issue for this?

  • I have searched the existing issues

Describe the bug

Split out from #1295. This is confirmed to be a platform- and target-agnostic race bug, not anything specific to net10.0. Although it is possible that it is surfacing more now on net10.0 for undeterminable framework reasons.

Original test failure report:

Expected: True  Actual: False
(Test: Lucene.Net.Index.TestTransactions.TestTransactions_Mem)


To reproduce this test result:


Option 1:


Apply the following assembly-level attributes:


[assembly: Lucene.Net.Util.RandomSeed("0x05988a4671cb8d53:0x175a1893dc7e9151")]
[assembly: NUnit.Framework.SetCulture("ff-Latn-SN")]


Option 2:


Use the following .runsettings file:


<RunSettings>
  <TestRunParameters>
    <Parameter name="tests:seed" value="0x05988a4671cb8d53:0x175a1893dc7e9151" />
    <Parameter name="tests:culture" value="ff-Latn-SN" />
  </TestRunParameters>
</RunSettings>
Option 3:


Create the following lucene.testsettings.json file somewhere between the test assembly and the root of your drive:


{
"tests": {
"seed": "0x05988a4671cb8d53:0x175a1893dc7e9151",
"culture": "ff-Latn-SN"
}
}


Fixture Test Values

Random Seed:           0x05988a4671cb8d53:0x175a1893dc7e9151
Culture:               ff-Latn-SN
Time Zone:             (UTC-05:00) Eastern Time (Port-au-Prince)
Default Codec:         Lucene46 (RandomCodec)
Default Similarity:    DefaultSimilarity


System Properties

Nightly:               False
Weekly:                False
Slow:                  True
Awaits Fix:            False
Directory:             random
Verbose:               False
Random Multiplier:     1

In TestTransactions.IndexerThread.DoWork, it catches any exceptions that occur in PrepareCommit, and if so, rolls back the writers, then returns. This is because it doesn't care about the actual details of the exception, just that the transactional protocol works correctly in the presence of random I/O failures.

However, it does not catch any exceptions thrown by Commit. When forced to throw exceptions in Commit, this test failure can be reproduced. By adding try/catch around the Commit call, like in the PrepareCommit case before it, the artificially-forced failure test is fixed.

This appears to simply be a bug (or perhaps a limitation, to put it milder) in the test code, and the same limitation exists in the Java code. They likely might have occasionally run into this failure too.

Is this .NET 10 related? It does not appear to be. By forcing failure in Commit, the test failure can be reliably reproduced on .NET 8-10 (did not try .NET Framework yet). Likewise, several hours of repeated, focused test runs of this test did not show any failures, so it is not easily reproducible as-is. It is always possible that performance differences in new framework versions can cause races to appear more or less frequently, nondeterministically.

Why is it rare? For this scenario to happen, the following things have to be true:

  1. The first PrepareCommit call has to succeed. Given the many calls it makes where it can randomly fail, this percentage is very low. A rough estimate from tracing the logic is that this happens about 0.01% of the time just based on purposefully-thrown exceptions alone.
  2. The second PrepareCommit call has to succeed. Square the probability of item 1.
  3. One of the two Commit calls has to throw. This is also not guaranteed, but more likely than not if you get to this point.

In 500 repeated runs of the test on .NET 10 (macOS, arm64), with instrumentation added about how often each threw, the results are striking:

  • PrepareCommit call 1 threw 1531 times (100% of the time)
  • PrepareCommit call 2 threw 0 times (did not get there)
  • Commit threw 0 times (did not get there)

Solution: We should catch and swallow exceptions in Commit for this test and roll back, since that is not the functionality under test. In fact, the functionality under test is precisely expecting that exceptions do happen. Expecting them not to happen is not the goal of the test. We should do the same behavior as if a call to PrepareCommit fails.

Aside: It arguably is a poorly-designed test if PrepareCommit throws roughly 100% of the time on the first call. If that is the case, it probably should just be set to throw all the time, no matter what, and not even try a second PrepareCommit or Commit step. But a better solution might be, we could configure this test to throw random exceptions less often. That would let it more properly exercise the transactional behavior in different scenarios of failure AND success, and then you might have different doc counts to assert, if it can get through to Commit successfully from time to time. Currently, in the very rare scenario where it gets past all 4 calls and succeeds, we don't know about it if that happens. Regardless, we would still need the catch around Commit, since it is expected to fail if it gets to it.

Expected Behavior

No response

Steps To Reproduce

No response

Exceptions (if any)

No response

Lucene.NET Version

No response

.NET Version

No response

Operating System

No response

Anything else?

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No fields configured for Bug.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions