Skip to content

[experimental] Run crosshair in CI #4034

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 24, 2025

Conversation

Zac-HD
Copy link
Member

@Zac-HD Zac-HD commented Jul 7, 2024

See #3914

To reproduce this locally, you can run make check-crosshair-cover/nocover/niche for the same command as in CI, but I'd recommend pytest --hypothesis-profile=crosshair hypothesis-python/tests/{cover,nocover,datetime} -m xf_crosshair --runxfail to select and run only the xfailed tests.

Hypothesis' problems

  • Vast majority of failures are Flaky: Inconsistent results from replaying a failing test... - mostly backend-specific failures; we've both
    • improved reporting in this case to show the crosshair-specific traceback
    • got most of the affected tests passing
  • Invalid internal boolean probability, e.g. "hypothesis/internal/conjecture/data.py", line 2277, in draw_boolean assert p > 2 ** (-64), fixed in 1f845e0 (#4049)
  • many of our test helpers involved nested use of @given, fixed in 3315be6
  • symbolic outside context
  • avoid uninstalling typing_extensions when crosshair depends on it
  • tests which are not really expected to pass on other backends. I'm slowly applying a backend-specific xfail decorator to them, @xfail_on_crosshair(...).
    • tests which expect to raise a healthcheck, and fail because our crosshair profile disables healthchecks. Disable only .too_slow and .filter_too_much, and skip remaining affected tests under crosshair.
    • undo some over-broad skips, e.g. various xfail decorators, pytestmarks, -k 'not decimal' once we're closer
  • provide a special exception type for when running the test or realizing values would hit a PathTimeout; see Rare PathTimeout errors in provider.realize(...) pschanely/hypothesis-crosshair#21 and Further improve support for symbolic execution #3914 (comment)
    • and something to signal that we've exhausted Crosshair's ability to explore the test. If this is sound, we've verified the function and can stop! (and should record that in the stop_reason). If unsound, we can continue testing with Hypothesis' default backend - so it's important to distinguish.
      Add BackendCannotProceed to improve integration #4092

Probably Crosshair's problems

Error in operator.eq(Decimal('sNaN'), an_int)

____ test_rewriting_does_not_compare_decimal_snan ____
  File "hypothesis/strategies/_internal/strategies.py", line 1017, in do_filtered_draw
    if self.condition(value):
TypeError: argument must be an integer
while generating 's' from integers(min_value=1, max_value=5).filter(functools.partial(eq, Decimal('sNaN')))

Cases where crosshair doesn't find a failing example but Hypothesis does

Seems fine, there are plenty of cases in the other direction. Tracked with @xfail_on_crosshair(Why.undiscovered) in case we want to dig in later.

Nested use of the Hypothesis engine (e.g. given-inside-given)

This is just explicitly unsupported for now. Hypothesis should probably offer some way for backends to declare that they don't support this, and then raise a helpful error message if you try anyway.

@Zac-HD Zac-HD added tests/build/CI about testing or deployment *of* Hypothesis interop how to play nicely with other packages labels Jul 7, 2024
@tybug

This comment was marked as outdated.

@Zac-HD

This comment was marked as outdated.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch 3 times, most recently from 175b347 to 424943f Compare July 7, 2024 20:26
@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch from 424943f to b2d11c7 Compare July 7, 2024 20:56
@pschanely
Copy link
Contributor

@Zac-HD your triage above is SO great. I am investigating.

@pschanely
Copy link
Contributor

pschanely commented Jul 8, 2024

Knocked out a few of these in 0.0.60.
I think that means current status on my end is:

  • TypeError: conversion from SymbolicInt to Decimal is not supported
  • Unsupported operand type(s) for -: 'float' and 'SymbolicFloat' in test_float_clamper
  • TypeError: descriptor 'keys' for 'dict' objects doesn't apply to a 'ShellMutableMap' object (or 'values' or 'items').
  • TypeError: _int() got an unexpected keyword argument 'base'
  • Symbolic not realized (in e.g. test_suppressing_filtering_health_check)
  • Error in operator.eq(Decimal('sNaN'), an_int)
  • Zac's cursed example below!

More soon.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch from b2d11c7 to 98ccf44 Compare July 11, 2024 07:23
@Zac-HD
Copy link
Member Author

Zac-HD commented Jul 12, 2024

Ah - the Flaky failures are of course because we had some failure under the Crosshair backend, which did not reproduce under the Hypothesis backend. This is presumably going to point to a range of integration bugs, but is also something that we'll want to clearly explain to users because integration bugs are definitely going to happen in future and users will need to respond (by e.g. using a different backend, ignoring the problem, whatever).

  • improve the reporting around Flaky failures where the differing or missing errors are related to a change of backend while shrinking. See also Change Flaky to be an ExceptionGroup #4040.
  • triage all the current failures so we can fix them

@Zac-HD

This comment was marked as outdated.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch from 98ccf44 to 4bd7e45 Compare July 12, 2024 07:48
@tybug
Copy link
Member

tybug commented Jul 12, 2024

Most/all of the "expected x, got symbolic" errors are symptoms of an underlying error in my experience (often operation on symbolic while not tracing). In this case running with export HYPOTHESIS_NO_TRACEBACK_TRIM=1 reveals limited_category_index_cache in cm.query is at fault.

@Zac-HD
Copy link
Member Author

Zac-HD commented Jul 12, 2024

ah-ha, seems like we might want some #4029 - style 'don't cache on backends with avoid_realize=True' logic.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch 2 times, most recently from 1d2345d to 7bf8983 Compare July 12, 2024 20:15
@pschanely
Copy link
Contributor

Still here and excited about this! I am on a detour of doing a real symbolic implementation of the decimal module - should get that out this weekend.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch 2 times, most recently from cc07927 to 018ccab Compare July 13, 2024 07:23
@Zac-HD
Copy link
Member Author

Zac-HD commented Jul 13, 2024

Triaging a pile of the Flaky erorrs, most were due to getting a RecursionError under crosshair and then passing under Hypothesis - and it looks like most of those were in turn because of all our nested-@given() test helpers.

So I've tried de-nesting those, which seems to work nicely and even makes things a bit faster by default; and when CI finishes we'll see how much it helps on crosshair 🤞

@Zac-HD

This comment was marked as outdated.

@tybug
Copy link
Member

tybug commented Mar 18, 2025

Looks like that worked, and we now have a very slow test in the a-d range (5+ hours before I gave up and pushed, which canceled it). We should rerun with verbosity to narrow it down; I'll cancel it for now to avoid wasting minutes.

@pschanely
Copy link
Contributor

Looks like that worked, and we now have a very slow test in the a-d range

I think the biggest cause of the slow ones is that the hypothesis time patching confounds CrossHair's timeout mechanisms. In my runs, I did this to work around that, but not sure what's actually appropriate. It might be better to just skip the slow ones.

@Zac-HD
Copy link
Member Author

Zac-HD commented Mar 18, 2025

We've skipped them so far, but maybe we should instead disable our monkeypatching-in-selftests for the Crosshair tests?

@tybug
Copy link
Member

tybug commented Mar 20, 2025

@Zac-HD what's the _hack_xfail_crosshair_error fixture for? is it related to the time monkeypatching? I didn't actually realize we were skipping monkeypatched-time tests currently, I thought we had fixed that by importing crosshair before patching it.

@tybug
Copy link
Member

tybug commented Mar 20, 2025

Latest run looks great. I found the test that was hanging in the a-d range by running locally, and skipped that one. Remaining failures are about 20% "probably a real problem" and 80% tests that need to be skipped or adjusted.

Here's a failure that is probably our fault for not realizing somewhere (CI run):

  File "/home/runner/work/hypothesis/hypothesis/hypothesis-python/.tox/crosshair-custom/lib/python3.10/site-packages/hypothesis/database.py", line 1068, in choices_to_bytes
    assert isinstance(elem, str)
AssertionError: assert False
 +  where False = isinstance(<[CrossHairInternal('Numeric operation on symbolic while not tracing') raised in repr()] SymbolicInt object at 0x7f7f20f61420>, str)

@Zac-HD
Copy link
Member Author

Zac-HD commented Mar 20, 2025

@Zac-HD what's the _hack_xfail_crosshair_error fixture for?

Oh, that was a very very temporary hack, I think we've fixed the underlying now and should delete it.

@pschanely
Copy link
Contributor

Latest run looks great. I found the test that was hanging in the a-d range by running locally, and skipped that one. Remaining failures are about 20% "probably a real problem" and 80% tests that need to be skipped or adjusted.

Here's a failure that is probably our fault for not realizing somewhere (CI run):

  File "/home/runner/work/hypothesis/hypothesis/hypothesis-python/.tox/crosshair-custom/lib/python3.10/site-packages/hypothesis/database.py", line 1068, in choices_to_bytes
    assert isinstance(elem, str)
AssertionError: assert False
 +  where False = isinstance(<[CrossHairInternal('Numeric operation on symbolic while not tracing') raised in repr()] SymbolicInt object at 0x7f7f20f61420>, str)

Yup - I think we're accessing data.choices here, just prior to the finally block that will do the realization we need.

@tybug
Copy link
Member

tybug commented Mar 22, 2025

oh, beautiful. That case is for fatal engine errors and takes a different save path than normal. We should realize there.

There's a second comment, which is that this fatal path was being taken for _pytest.outcomes.Skipped! I don't think this is correct, I think we should be except skip_exceptions_to_reraise(): and just reraising those for control flow rather than treating it as a failure that needs to be saved to the db.

@tybug
Copy link
Member

tybug commented Mar 22, 2025

I think the remaining failures are split roughly evenly between Hypothesis' fault, and tests that need to be skipped. They all deserve more thorough investigation to determine which one, and what the cause is. I haven't looked too deeply at them yet. We're getting really close to a clean run!

@tybug
Copy link
Member

tybug commented Mar 22, 2025

Posting some early diagnostics here: I think we're interacting with the symbolics when observability mode is enabled (ie we have a testcase callback) in such a way that causes the symbolic to pick up a path constraint (or maybe be fully realized?), because the following is very fast:

from hypothesis import *
from hypothesis import strategies as st

@given(st.integers(), st.floats(), st.data())
@settings(backend="crosshair")
def f(v1, v2, data):
    print("call")
    data.draw(st.booleans())
f()

but when you add a testcase callback, it gets much slower, and crosshair abandons some test cases with BackendCannotProceed, which shouldn't be happening for a test that doesn't interact with its args at all:

import hypothesis.internal.observability
from hypothesis import *
from hypothesis import strategies as st

def f(x):
    pass

hypothesis.internal.observability.TESTCASE_CALLBACKS.append(f)

@given(st.integers(), st.floats(), st.data())
@settings(backend="crosshair")
def f(v1, v2, data):
    print("call")
    data.draw(st.booleans())
f()

e: yup, repr_call and to_jsonable are adding constraints or straight up realizing. I'm looking into a fix

@tybug
Copy link
Member

tybug commented Mar 23, 2025

Latest push is an improvement, but I'm pretty confident we're still adding path constraints somewhere, because I still see a speed difference

@Zac-HD Zac-HD marked this pull request as ready for review March 24, 2025 02:50
@Zac-HD Zac-HD requested a review from DRMacIver as a code owner March 24, 2025 02:50
@Zac-HD Zac-HD merged commit e66f7fb into HypothesisWorks:master Mar 24, 2025
59 of 60 checks passed
@Zac-HD Zac-HD deleted the crosshair-in-ci branch March 24, 2025 03:03
@Zac-HD Zac-HD mentioned this pull request Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interop how to play nicely with other packages tests/build/CI about testing or deployment *of* Hypothesis
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants