-
-
Notifications
You must be signed in to change notification settings - Fork 370
Description
Is your feature request related to a problem? Please describe.
When SchemaError occurs in pandera, it outputs the full log of all schema-errored elements. This becomes problematic when validating large dataframes with many validation failures, as the error log can become tremendous - potentially thousands of lines long. This makes it difficult to read error messages, can crash terminals or notebooks (this is the main motivation why I raised this issue and implemented), and makes debugging harder as the important information gets buried in excessive output.
Describe the solution you'd like
Add a configurable limit to the number of failure cases shown in SchemaError messages. The implementation should:
- Set a default limit of 100 failure cases shown
- Allow configuration via environment variable PANDERA_MAX_FAILURE_CASES
- Support runtime configuration through config_context(max_failure_cases=n)
- Use -1 as a special value to show all failure cases (unlimited)
- Use 0 to show only the count without examples
- When truncated, display a message like "... and X more failure cases (Y total)"
Describe alternatives you've considered
- Fixed hard limit: Simply truncate at a fixed number (e.g., always show 10) - rejected because different use cases need different limits
- Log to file: Dump full errors to a file instead of console - rejected as it adds complexity and doesn't solve the immediate visibility problem
- Sampling: Show a random sample of errors - rejected because the first N errors are often more informative than a random sample
- Using None for unlimited: Considered using Optional[int] with None meaning unlimited - rejected in favor of -1 as sentinel value to keep the type as int
Additional context
#2095
This feature is particularly important for:
- Data pipelines processing large datasets (millions of rows)
- Initial data exploration where many validation rules might fail
- Production environments where excessive logging can impact performance
- Jupyter notebooks where huge outputs can make the notebook unresponsive
Example of current behavior with a 1M row dataframe having 500K failures:
SchemaError: Column 'amount' failed validator: greater_than(0) failure cases: -1.5, -2.3, -0.1, -5.2, ... [500,000 values printed]
With this feature (default limit=100):
SchemaError: Column 'amount' failed validator: greater_than(0) failure cases: -1.5, -2.3, -0.1, -5.2, ... [96 more values] ... and 499,900 more failure cases (500,000 total)