Skip to content

Limit log length for SchemaError #2094

@stakahashy

Description

@stakahashy

Is your feature request related to a problem? Please describe.
When SchemaError occurs in pandera, it outputs the full log of all schema-errored elements. This becomes problematic when validating large dataframes with many validation failures, as the error log can become tremendous - potentially thousands of lines long. This makes it difficult to read error messages, can crash terminals or notebooks (this is the main motivation why I raised this issue and implemented), and makes debugging harder as the important information gets buried in excessive output.

Describe the solution you'd like
Add a configurable limit to the number of failure cases shown in SchemaError messages. The implementation should:

  • Set a default limit of 100 failure cases shown
  • Allow configuration via environment variable PANDERA_MAX_FAILURE_CASES
  • Support runtime configuration through config_context(max_failure_cases=n)
  • Use -1 as a special value to show all failure cases (unlimited)
  • Use 0 to show only the count without examples
  • When truncated, display a message like "... and X more failure cases (Y total)"

Describe alternatives you've considered

  1. Fixed hard limit: Simply truncate at a fixed number (e.g., always show 10) - rejected because different use cases need different limits
  2. Log to file: Dump full errors to a file instead of console - rejected as it adds complexity and doesn't solve the immediate visibility problem
  3. Sampling: Show a random sample of errors - rejected because the first N errors are often more informative than a random sample
  4. Using None for unlimited: Considered using Optional[int] with None meaning unlimited - rejected in favor of -1 as sentinel value to keep the type as int

Additional context
#2095

This feature is particularly important for:

  • Data pipelines processing large datasets (millions of rows)
  • Initial data exploration where many validation rules might fail
  • Production environments where excessive logging can impact performance
  • Jupyter notebooks where huge outputs can make the notebook unresponsive

Example of current behavior with a 1M row dataframe having 500K failures:

  SchemaError: Column 'amount' failed validator: greater_than(0) failure cases: -1.5, -2.3, -0.1, -5.2, ... [500,000 values printed]

With this feature (default limit=100):

  SchemaError: Column 'amount' failed validator: greater_than(0) failure cases: -1.5, -2.3, -0.1, -5.2, ... [96 more values] ... and 499,900 more failure cases (500,000 total)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions