Skip to content

Simplify default cut labels #422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: nl/quantilecut
Choose a base branch
from
Open

Simplify default cut labels #422

wants to merge 12 commits into from

Conversation

nalimilan
Copy link
Member

  1. The quantile number isn't needed in most cases in the label, and anyway it's shown when printing an ordered CategoricalValue. Only use it by default when allowempty=true to avoid data-dependent errors if there are duplicate levels.

  2. Round breaks by default to a number of significant digits chosen by sigdigits. This number is increased if necessary for breaks to remain unique. This generates labels which are not completely correct as rounding may make the left break greater than a value which is included in the interval, but this is generally minor and expected. Taking the floor rather than rounding would be more correct, but it can generate unexpected labels due to floating point trickiness (e.g. floor(0.0003, sigdigits=4) gives 0.0002999). This is what R does.

Add a deprecation to avoid breaking custom labels functions which did not accept sigdigits.

Fixes #381. This is on top of #416 for simplicity but essential orthogonal.


Some points worth discussing:

  • I'm not super satisfied with the fact that 1000.0 is rendered as 1e+03 rather than 1000 by default (contrary to 100). This is the behavior of @sprintf, which apparently matches the C standard. We could use 4 significant digits by default to fix this, but it would make the output more verbose than it needs to be in general. We could also call @sprintf with both %f and %e and chose the shortest but it's a bit ad-hoc. I couldn't find a "better" printing method in packages (e.g. Format.jl).
  • Integers are always rendered using string, so they never use scientific notation. This fixes the previous problem in most cases, but if you have very large values it could be annoying. Not sure.
  • I have experimented with additional code (removed by second commit so that you can see it) which checks whether rounding would make the left break of an interval become lower than the highest value in the previous interval, which means that the generated label doesn't exactly reflect its contents. In that case increasing the number of significant digits gives a mathematically correct result. But in the end I considered this was too complex for the gain: 1) it only works for quantiles, as for the general method it would require additional work to find the highest value of the previous interval (though it's certainly doable and relatively cheap) -- and the inconsistency isn't great; 2) it's probably too clever for users, in the sense that the number of digits would vary depending on the data without any visible reason when looking at the breaks; 3) anyway people are probably not too attached to exact values (one indication of this is that R rounds and I never saw any mention of this).

`Statistics.quantile` returns values which are not the most appropriate
to generate labels. It is more intuitive to choose values from the actual data,
which are likely to have fewer decimals and make more sense for users.

Unfortunately, since we use intervals closed on the left, we cannot use
any of the seven standard definitions of quantiles. Type 1 is the closest,
but we have to take the value next to it as a cutpoint to prevent it from
being included into the next quantile group. This gives essentially consistent
group attributions to R's `Hmisc::cut2` or
`cut(x, quantile(x, (0:n)/n, type=1), include.lowest=T))`,
though with different cutpoints in labels.
1) The quantile number isn't needed in most cases in the label,
and anyway it's shown when printing an ordered `CategoricalValue`.
Only use it by default when `allowempty=true` to avoid data-dependent
errors if there are duplicate levels.

2) Round breaks by default to a number of significant digits chosen by
`sigdigits`. This number is increased if necessary for breaks to remain unique.
This generates labels which are not completely correct as rounding may make
the left break greater than a value which is included in the interval,
but this is generally minor and expected. Taking the floor rather than
rounding would be more correct, but it can generate unexpected labels
due to floating point trickiness (e.g. `floor(0.0003, sigdigits=4)`
gives 0.0002999). This is what R does.

Add a deprecation to avoid breaking custom `labels` functions which did
not accept `sigdigits`.
@andreasnoack
Copy link
Member

I think this looks good but I can see that it relies on functionality that isn't supported in 1.6. With a new LTS, I'm fine with increasing the minimum version but maybe it is easy to adjust the implementation to avoid the newer functionality. It also looks like there is still a segfault on Windows.

@nalimilan
Copy link
Member Author

Yeah, I think I can easily use a slightly less efficient approach on Julia 1.6 (generating a format string on each call). The segfault on 32-bit Windows is always the same one related to Plots. I'll have to report it to them.

While you're at it, if you can have a look at the latest version of #416. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants