Conversation
Codecov Report
@@ Coverage Diff @@
## master #360 +/- ##
==========================================
- Coverage 84.46% 83.84% -0.63%
==========================================
Files 51 52 +1
Lines 3902 3961 +59
==========================================
+ Hits 3296 3321 +25
- Misses 606 640 +34
Continue to review full report at Codecov.
|
| from lux.vislib.altair.Histogram import Histogram | ||
| from lux.vislib.altair.Heatmap import Heatmap | ||
| from lux.vislib.altair.Choropleth import Choropleth | ||
| from lux.vislib.altair.BoxPlot import BoxPlot |
There was a problem hiding this comment.
If the user uses matplotlib for boxplots, could we render the boxplot in Altair and show an info button message letting users know that the matplotlib boxplot is not currently implemented? This is similar to what we did for the geographical maps in matplotlib.
There was a problem hiding this comment.
I've implemented the Altair fallback as well as the message. However, since I'm not being able to set intent on the dataframe due to the matplotlib bug, I'm not sure if the message works. Let me know if you'd like me to remove it since there is no way to verify!
| elif data_type_constraint == "nominal": | ||
| possible_attributes = [ | ||
| c for c in ldf.columns if ldf.data_type[c] == "nominal" and c != "Number of Records" | ||
| c |
There was a problem hiding this comment.
This line split is pretty weird and hard to read. Can we fix this and add a comment on what this list of possible_attributes is used for?
|
Thanks @jinimukh!! Can we file a follow-up issue to delegate boxplot calculations to the Pandas and SQL Executor? This will help with performance by bringing down the rendering speed from the cost of a scatterplot to that of a boxplot (several summary statistics + outliers). |
|
I'm wondering if ordinal data types have to be a subset of nominal data? Apart from the documentation and within the actions logic ( |
|
Here's some examples that I was playing around with: df = pd.read_csv("https://raw.githubusercontent.com/lux-org/lux-datasets/master/data/aug_test.csv")
df =df.dropna(subset=['education_level',"company_size"])
df.set_data_type({'education_level': "ordinal"},
order={'education_level': ['Primary School', 'High School', 'Masters','Graduate', 'Phd']})
df["education_level"]
df.set_data_type({'company_size': "ordinal"},
order={'company_size': [
'<10', '10/49', '50-99', '100-500',
'500-999', '1000-4999', '5000-9999','10000+'
]})
df["company_size"]I was initially a bit confused by why the boxplot was not shown for the number of records case in univariate (until we set the intent), then I realized that the boxplot didn't make sense for the ordinal data type. I wonder if it makes sense to have a bivariate ordinal data type tab, i.e., ordinal with respect to all measure values, so that the boxplot could be shown in the initial view. |
Overview
This PR addresses #240 by adding support for the ordinal data type. Currently, the only way to set the data type to ordinal is by using
df.set_data_type({"col_name": "ordinal})functionality. Optionally, if the entries do not have a natural ordering like number or alphabetical, a custom ordering can be specified usingdf.set_data_type({"col_name": "ordinal}, order={"col_name": [ordered_lst]}). To visualize ordinal data types, we are using boxplots but because they are bivariate distributions, they only show up to enhance a selected visualization.Changes
univariate.py: allowordinaldata types to be treated asnominaldata types to create bar graphs inOccurrencestabframe.py: allow theset_data_typefunction to take in optionalorderargument to specify orders on ordinal dataBoxPlot.py: currently only supports Altair BoxPlotsCompiler.py: allow the mark to beboxwhenn_dim == 1 and n_msr == 1 anddimension_type == "ordinal"`Example Output