-
Notifications
You must be signed in to change notification settings - Fork 35
Simplify default cut
labels
#422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: nl/quantilecut
Are you sure you want to change the base?
Conversation
`Statistics.quantile` returns values which are not the most appropriate to generate labels. It is more intuitive to choose values from the actual data, which are likely to have fewer decimals and make more sense for users. Unfortunately, since we use intervals closed on the left, we cannot use any of the seven standard definitions of quantiles. Type 1 is the closest, but we have to take the value next to it as a cutpoint to prevent it from being included into the next quantile group. This gives essentially consistent group attributions to R's `Hmisc::cut2` or `cut(x, quantile(x, (0:n)/n, type=1), include.lowest=T))`, though with different cutpoints in labels.
1) The quantile number isn't needed in most cases in the label, and anyway it's shown when printing an ordered `CategoricalValue`. Only use it by default when `allowempty=true` to avoid data-dependent errors if there are duplicate levels. 2) Round breaks by default to a number of significant digits chosen by `sigdigits`. This number is increased if necessary for breaks to remain unique. This generates labels which are not completely correct as rounding may make the left break greater than a value which is included in the interval, but this is generally minor and expected. Taking the floor rather than rounding would be more correct, but it can generate unexpected labels due to floating point trickiness (e.g. `floor(0.0003, sigdigits=4)` gives 0.0002999). This is what R does. Add a deprecation to avoid breaking custom `labels` functions which did not accept `sigdigits`.
I think this looks good but I can see that it relies on functionality that isn't supported in 1.6. With a new LTS, I'm fine with increasing the minimum version but maybe it is easy to adjust the implementation to avoid the newer functionality. It also looks like there is still a segfault on Windows. |
Yeah, I think I can easily use a slightly less efficient approach on Julia 1.6 (generating a format string on each call). The segfault on 32-bit Windows is always the same one related to Plots. I'll have to report it to them. While you're at it, if you can have a look at the latest version of #416. :-) |
e5d84c7
to
a33a853
Compare
The quantile number isn't needed in most cases in the label, and anyway it's shown when printing an ordered
CategoricalValue
. Only use it by default whenallowempty=true
to avoid data-dependent errors if there are duplicate levels.Round breaks by default to a number of significant digits chosen by
sigdigits
. This number is increased if necessary for breaks to remain unique. This generates labels which are not completely correct as rounding may make the left break greater than a value which is included in the interval, but this is generally minor and expected. Taking the floor rather than rounding would be more correct, but it can generate unexpected labels due to floating point trickiness (e.g.floor(0.0003, sigdigits=4)
gives 0.0002999). This is what R does.Add a deprecation to avoid breaking custom
labels
functions which did not acceptsigdigits
.Fixes #381. This is on top of #416 for simplicity but essential orthogonal.
Some points worth discussing:
@sprintf
, which apparently matches the C standard. We could use 4 significant digits by default to fix this, but it would make the output more verbose than it needs to be in general. We could also call@sprintf
with both%f
and%e
and chose the shortest but it's a bit ad-hoc. I couldn't find a "better" printing method in packages (e.g. Format.jl).string
, so they never use scientific notation. This fixes the previous problem in most cases, but if you have very large values it could be annoying. Not sure.