-
Notifications
You must be signed in to change notification settings - Fork 69
Description
Is there an existing issue for this?
- I have searched the existing issues
Problem statement
The "profile" method uses the Databricks built-in "summary()" function to generate the summary statistics report. This function returns some predefined metrics that may not be suitable for text variables, especially ones associated with numeric types (mean,stddev, etc). For some of them, the summary stats already produce Null values, while for others they still return results, such as the minimum and maximum metrics, as in the example below.
( 'work_order_id': {'count': 6,
'mean': None,
'stddev': None,
'min': 'INVALID',
'25%': None,
'50%': None,
'75%': None,
'max': 'WO-005',
'count_non_null': 6,
'count_null': 0})
Proposed Solution
One approach is to return null values for "min" and "max" variables too, once it seems a more consistent output for this datatype. Another approach would be to generate/add a set of specific metrics for low cardinality string datatypes, like "count_distinct","max_length","mean_length","median_length","min_length","histogram",etc.
Additional Context
It would be useful to keep track of changes in the data source profile through time, and this data is available only in the summary_stats variable. I think that the variables metrics could be datatype driven.