Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[experimental] Run crosshair in CI #4034

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

Zac-HD
Copy link
Member

@Zac-HD Zac-HD commented Jul 7, 2024

See #3914

To reproduce this locally, you can run make check-crosshair-cover/nocover/niche for the same command as in CI, but I'd recommend pytest --hypothesis-profile=crosshair hypothesis-python/tests/{cover,nocover,datetime} -m xf_crosshair --runxfail to select and run only the xfailed tests.

Hypothesis' problems

  • Vast majority of failures are Flaky: Inconsistent results from replaying a failing test... - mostly backend-specific failures; we've both
    • improved reporting in this case to show the crosshair-specific traceback
    • got most of the affected tests passing
  • Invalid internal boolean probability, e.g. "hypothesis/internal/conjecture/data.py", line 2277, in draw_boolean assert p > 2 ** (-64), fixed in 1f845e0 (#4049)
  • many of our test helpers involved nested use of @given, fixed in 3315be6
  • symbolic outside context
  • avoid uninstalling typing_extensions when crosshair depends on it
  • tests which are not really expected to pass on other backends. I'm slowly applying a backend-specific xfail decorator to them, @xfail_on_crosshair(...).
    • tests which expect to raise a healthcheck, and fail because our crosshair profile disables healthchecks. Disable only .too_slow and .filter_too_much, and skip remaining affected tests under crosshair.
    • undo some over-broad skips, e.g. various xfail decorators, pytestmarks, -k 'not decimal' once we're closer
  • provide a special exception type for when running the test or realizing values would hit a PathTimeout; see Rare PathTimeout errors in provider.realize(...) pschanely/hypothesis-crosshair#21 and Stable support for symbolic execution #3914 (comment)
    • and something to signal that we've exhausted Crosshair's ability to explore the test. If this is sound, we've verified the function and can stop! (and should record that in the stop_reason). If unsound, we can continue testing with Hypothesis' default backend - so it's important to distinguish.
      Add BackendCannotProceed to improve integration #4092

Probably Crosshair's problems

Error in operator.eq(Decimal('sNaN'), an_int)

____ test_rewriting_does_not_compare_decimal_snan ____
  File "hypothesis/strategies/_internal/strategies.py", line 1017, in do_filtered_draw
    if self.condition(value):
TypeError: argument must be an integer
while generating 's' from integers(min_value=1, max_value=5).filter(functools.partial(eq, Decimal('sNaN')))

Cases where crosshair doesn't find a failing example but Hypothesis does

Seems fine, there are plenty of cases in the other direction. Tracked with @xfail_on_crosshair(Why.undiscovered) in case we want to dig in later.

Nested use of the Hypothesis engine (e.g. given-inside-given)

This is just explicitly unsupported for now. Hypothesis should probably offer some way for backends to declare that they don't support this, and then raise a helpful error message if you try anyway.

@Zac-HD Zac-HD added tests/build/CI about testing or deployment *of* Hypothesis interop how to play nicely with other packages labels Jul 7, 2024
@tybug

This comment was marked as outdated.

@Zac-HD

This comment was marked as outdated.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch 3 times, most recently from 175b347 to 424943f Compare July 7, 2024 20:26
@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch from 424943f to b2d11c7 Compare July 7, 2024 20:56
@pschanely
Copy link
Contributor

@Zac-HD your triage above is SO great. I am investigating.

@pschanely
Copy link
Contributor

pschanely commented Jul 8, 2024

Knocked out a few of these in 0.0.60.
I think that means current status on my end is:

  • TypeError: conversion from SymbolicInt to Decimal is not supported
  • Unsupported operand type(s) for -: 'float' and 'SymbolicFloat' in test_float_clamper
  • TypeError: descriptor 'keys' for 'dict' objects doesn't apply to a 'ShellMutableMap' object (or 'values' or 'items').
  • TypeError: _int() got an unexpected keyword argument 'base'
  • Symbolic not realized (in e.g. test_suppressing_filtering_health_check)
  • Error in operator.eq(Decimal('sNaN'), an_int)
  • Zac's cursed example below!

More soon.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch from b2d11c7 to 98ccf44 Compare July 11, 2024 07:23
@Zac-HD
Copy link
Member Author

Zac-HD commented Jul 12, 2024

Ah - the Flaky failures are of course because we had some failure under the Crosshair backend, which did not reproduce under the Hypothesis backend. This is presumably going to point to a range of integration bugs, but is also something that we'll want to clearly explain to users because integration bugs are definitely going to happen in future and users will need to respond (by e.g. using a different backend, ignoring the problem, whatever).

  • improve the reporting around Flaky failures where the differing or missing errors are related to a change of backend while shrinking. See also Change Flaky to be an ExceptionGroup #4040.
  • triage all the current failures so we can fix them

@Zac-HD

This comment was marked as outdated.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch from 98ccf44 to 4bd7e45 Compare July 12, 2024 07:48
@tybug
Copy link
Member

tybug commented Jul 12, 2024

Most/all of the "expected x, got symbolic" errors are symptoms of an underlying error in my experience (often operation on symbolic while not tracing). In this case running with export HYPOTHESIS_NO_TRACEBACK_TRIM=1 reveals limited_category_index_cache in cm.query is at fault.

@Zac-HD
Copy link
Member Author

Zac-HD commented Jul 12, 2024

ah-ha, seems like we might want some #4029 - style 'don't cache on backends with avoid_realize=True' logic.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch 2 times, most recently from 1d2345d to 7bf8983 Compare July 12, 2024 20:15
@pschanely
Copy link
Contributor

Still here and excited about this! I am on a detour of doing a real symbolic implementation of the decimal module - should get that out this weekend.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch 2 times, most recently from cc07927 to 018ccab Compare July 13, 2024 07:23
@Zac-HD
Copy link
Member Author

Zac-HD commented Jul 13, 2024

Triaging a pile of the Flaky erorrs, most were due to getting a RecursionError under crosshair and then passing under Hypothesis - and it looks like most of those were in turn because of all our nested-@given() test helpers.

So I've tried de-nesting those, which seems to work nicely and even makes things a bit faster by default; and when CI finishes we'll see how much it helps on crosshair 🤞

@Zac-HD

This comment was marked as outdated.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch from 30fe0b8 to d51ad9f Compare March 2, 2025 09:17
@Zac-HD
Copy link
Member Author

Zac-HD commented Mar 2, 2025

OK, we have very many crosshair.util.CrossHairInternal: Not in a statespace context errors, which I think must have some systematic underlying cause. Not looking further at this time of day, but I'd like to get this merged in the next week or two, so we can show it off at BugBash as a stable feature 😁 (and maybe also integrate into HypoFuzz!)

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch 2 times, most recently from 102058b to 262dba3 Compare March 2, 2025 23:53
@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch from 262dba3 to 95e734c Compare March 3, 2025 00:05
@tybug tybug force-pushed the crosshair-in-ci branch 2 times, most recently from 0e4d99d to 95e734c Compare March 3, 2025 07:39
@tybug
Copy link
Member

tybug commented Mar 3, 2025

(sorry, my local was out of date and force pushes above were me rebasing and subsequently realizing that had already been done in the up-to-date branch. nothing should have changed)

@Zac-HD
Copy link
Member Author

Zac-HD commented Mar 3, 2025

HYPOTHESIS_NO_TRACEBACK_TRIM=1 pytest --hypothesis-profile=crosshair hypothesis-python/tests/cover/test_deadline.py -x points here, which I consider pretty suspicious from a "we seem to have a lifetime error" perspective. I'm not actively chasing this, but let's stick to non-force pushes?

@tybug
Copy link
Member

tybug commented Mar 3, 2025

Yup, Phillip beat us to it 😅 pschanely@fd6958f

Underlying reason is crosshair raising discard_test_case for the initial zero_data call. We then pin it and check for health checks - but it was discarded, and therefore not realized. Maybe we should just not do any of those things if a backend raises discard_test_case? I'm hesitant to disable this for all alternative backends, but we don't currently have a way to check the BackendCannotProceed status of a data, though we could add that status to the data before freezing it.

This is also in part because we decided to give alternative backends control over NodeTemplate(type="simplest") draws in #4247.

@tybug
Copy link
Member

tybug commented Mar 17, 2025

I've addressed the zero_pin issue discussed above by tracking the cannot_proceed reason on ConjectureData, and not pinning if the backend could not proceed, assuming that if the test case cannot proceed then the values cannot be realized either.

@pschanely after doing this the CI run shows crosshair returning values of the wrong type from CrosshairPrimitiveProvider.realize on the second test case, after the first test case raises BackendCannotProceed (search hypothesis.errors.HypothesisException: expected <class 'int'>). I wonder if the provider is not removing the choices made during cannot-proceed iterations from the list of values to return while realizing?

Unfortunately this is flaky locally, possibly due to speed-related time outs. HYPOTHESIS_NO_TRACEBACK_TRIM=1 pytest --hypothesis-profile=crosshair -k test_adds_note_showing_which_strategy_stateful should reproduce in theory. Run in cd hypothesis-python, not the root dir.

@pschanely
Copy link
Contributor

I've addressed the zero_pin issue discussed above by tracking the cannot_proceed reason on ConjectureData, and not pinning if the backend could not proceed, assuming that if the test case cannot proceed then the values cannot be realized either.

Thank you!

@pschanely after doing this the CI run shows crosshair returning values of the wrong type from CrosshairPrimitiveProvider.realize on the second test case, after the first test case raises BackendCannotProceed (search hypothesis.errors.HypothesisException: expected <class 'int'>). I wonder if the provider is not removing the choices made during cannot-proceed iterations from the list of values to return while realizing?

SGTM; I will investigate tomorrow! One change that we made a while back is possibly relevant here: to avoid false-positives from approximations like real-based floats, we don't actually report the error right away; instead we'll abort the failing pass and try to recreate the pass next time with concrete draws. This can go haywire though if the requested draw types aren't the same on the next pass.

@pschanely
Copy link
Contributor

SGTM; I will investigate tomorrow! One change that we made a while back is possibly relevant here: to avoid false-positives from approximations like real-based floats, we don't actually report the error right away; instead we'll abort the failing pass and try to recreate the pass next time with concrete draws. This can go haywire though if the requested draw types aren't the same on the next pass.

Ok, @tybug so I've confirmed that our first and second passes have inconsistent draws, but only got through one layer of diagnostics. Maybe it's enough to spark an idea? Fortunately, at least for me, the issue happens consistently. 🤷

The first draw of the first pass is for a boolean, here: (_execute_once_for_engine core.py:1091) (execute_once core.py:1028) (default_executor core.py:729) (run core.py:1002) (run_state_machine stateful.py:115) (test core.py:894) (run_state_machine stateful.py:160) (draw_boolean data.py:929) (_draw data.py:749) (_pop_choice data.py:972) (draw_boolean crosshair_provider.py:231)

But the second pass skips over this draw. In order for ConjectureData._draw to query the backend, I think(?) we have to get one of these conditions to pass.
In the first pass, observe is True and prefix is singleton tuple with a ChoiceTemplate inside it.
In the second pass though, prefix is an empty list. forced=False (not None) in both passes, so crosshair doesn't get queried for the boolean at all.

Inconsistent draws aside, I should change the provider to ensure it yields SOME value of the requested type. After all, I'm running the concrete execution under the premise that the behavior might actually differ, so I shouldn't be quite so surprised when it does. 😆

@tybug
Copy link
Member

tybug commented Mar 18, 2025

ah, this is great, thanks. I think Hypothesis requests a value from the backend for forced + ChoiceTemplate, but not for forced, so crosshair is rightfully unhappy about nondeterminism. We should unify this to never request from the backend if forced is present.

@tybug
Copy link
Member

tybug commented Mar 18, 2025

Looks like that worked, and we now have a very slow test in the a-d range (5+ hours before I gave up and pushed, which canceled it). We should rerun with verbosity to narrow it down; I'll cancel it for now to avoid wasting minutes.

@pschanely
Copy link
Contributor

Looks like that worked, and we now have a very slow test in the a-d range

I think the biggest cause of the slow ones is that the hypothesis time patching confounds CrossHair's timeout mechanisms. In my runs, I did this to work around that, but not sure what's actually appropriate. It might be better to just skip the slow ones.

@Zac-HD
Copy link
Member Author

Zac-HD commented Mar 18, 2025

We've skipped them so far, but maybe we should instead disable our monkeypatching-in-selftests for the Crosshair tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interop how to play nicely with other packages tests/build/CI about testing or deployment *of* Hypothesis
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants