Skip to content

[FEATURE]: Improve summary stats report for string datatype columns #670

@STEFANOVIVAS

Description

@STEFANOVIVAS

Is there an existing issue for this?

  • I have searched the existing issues

Problem statement

The "profile" method uses the Databricks built-in "summary()" function to generate the summary statistics report. This function returns some predefined metrics that may not be suitable for text variables, especially ones associated with numeric types (mean,stddev, etc). For some of them, the summary stats already produce Null values, while for others they still return results, such as the minimum and maximum metrics, as in the example below.

( 'work_order_id': {'count': 6,
'mean': None,
'stddev': None,
'min': 'INVALID',
'25%': None,
'50%': None,
'75%': None,
'max': 'WO-005',
'count_non_null': 6,
'count_null': 0})

Proposed Solution

One approach is to return null values for "min" and "max" variables too, once it seems a more consistent output for this datatype. Another approach would be to generate/add a set of specific metrics for low cardinality string datatypes, like "count_distinct","max_length","mean_length","median_length","min_length","histogram",etc.

Additional Context

It would be useful to keep track of changes in the data source profile through time, and this data is available only in the summary_stats variable. I think that the variables metrics could be datatype driven.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions