[FEATURE] Avoid Full Scan for Size() Metric Calculation

**Is your feature request related to a problem? Please describe.**
Currently, the Size() metric in Deequ is calculated by performing a full scan (i.e., [executing a complete .count() on the DataFrame](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/analyzers/Size.scala#L41C5-L41C21)) to determine the total number of rows. This approach is inefficient, particularly on large datasets. Moreover, many other analyzers already compute the row count as part of their aggregations (e.g., for calculating metrics like UniqueValueRatio or Distinctness), which makes an independent Size() calculation redundant and results in unnecessary additional processing.

**Describe the solution you'd like**
I propose that the Size() computation be integrated into the existing aggregation scans performed by other analyzers in the AnalysisRunner. Specifically, when executing scanning analyzers, the row count should be calculated once and reused for the Size() metric rather than triggering an extra full scan. Additionally, for grouping analyzers that derive the total number of rows via frequency aggregations, that information should be repurposed to set the Size() metric. This integration would avoid redundant data scans and improve the overall performance of the analysis process.

**Describe alternatives you've considered**
One alternative is to continue computing the Size() metric as a separate aggregation, but that inherently leads to an extra full scan, which is inefficient. Another option might be to allow users to disable the separate computation via configuration; however, that adds extra complexity and does not fully address the performance issue. Integrating the row count calculation into the existing scan logic seems to be the optimal solution, as it leverages already computed values and minimizes the overall processing overhead.

**Additional context**
There is a [TODO comment in the AnalysisRunner.doAnalysisRun method](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/analyzers/runners/AnalysisRunner.scala#L171C1-L175C6) that suggests obtaining the number of rows from other metrics or combining the Size() computation with an extra scan if data is already being processed. Implementing this enhancement would eliminate the redundant full scan for Size() and streamline metric computation, leading to significant performance improvements on large datasets. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] Avoid Full Scan for Size() Metric Calculation #600

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE] Avoid Full Scan for Size() Metric Calculation #600

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions