Skip to content

[FEATURE] Avoid Full Scan for Size() Metric Calculation #600

@lawofcycles

Description

@lawofcycles

Is your feature request related to a problem? Please describe.
Currently, the Size() metric in Deequ is calculated by performing a full scan (i.e., executing a complete .count() on the DataFrame) to determine the total number of rows. This approach is inefficient, particularly on large datasets. Moreover, many other analyzers already compute the row count as part of their aggregations (e.g., for calculating metrics like UniqueValueRatio or Distinctness), which makes an independent Size() calculation redundant and results in unnecessary additional processing.

Describe the solution you'd like
I propose that the Size() computation be integrated into the existing aggregation scans performed by other analyzers in the AnalysisRunner. Specifically, when executing scanning analyzers, the row count should be calculated once and reused for the Size() metric rather than triggering an extra full scan. Additionally, for grouping analyzers that derive the total number of rows via frequency aggregations, that information should be repurposed to set the Size() metric. This integration would avoid redundant data scans and improve the overall performance of the analysis process.

Describe alternatives you've considered
One alternative is to continue computing the Size() metric as a separate aggregation, but that inherently leads to an extra full scan, which is inefficient. Another option might be to allow users to disable the separate computation via configuration; however, that adds extra complexity and does not fully address the performance issue. Integrating the row count calculation into the existing scan logic seems to be the optimal solution, as it leverages already computed values and minimizes the overall processing overhead.

Additional context
There is a TODO comment in the AnalysisRunner.doAnalysisRun method that suggests obtaining the number of rows from other metrics or combining the Size() computation with an extra scan if data is already being processed. Implementing this enhancement would eliminate the redundant full scan for Size() and streamline metric computation, leading to significant performance improvements on large datasets.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions