Simplify default `cut` labels #422

nalimilan · 2025-05-17T09:42:31Z

The quantile number isn't needed in most cases in the label, and anyway it's shown when printing an ordered CategoricalValue. Only use it by default when allowempty=true to avoid data-dependent errors if there are duplicate levels.
Round breaks by default to a number of significant digits chosen by sigdigits. This number is increased if necessary for breaks to remain unique. This generates labels which are not completely correct as rounding may make the left break greater than a value which is included in the interval, but this is generally minor and expected. Taking the floor rather than rounding would be more correct, but it can generate unexpected labels due to floating point trickiness (e.g. floor(0.0003, sigdigits=4) gives 0.0002999). This is what R does.

Add a deprecation to avoid breaking custom labels functions which did not accept sigdigits.

Fixes #381. This is on top of #416 for simplicity but essential orthogonal.

Some points worth discussing:

I'm not super satisfied with the fact that 1000.0 is rendered as 1e+03 rather than 1000 by default (contrary to 100). This is the behavior of @sprintf, which apparently matches the C standard. We could use 4 significant digits by default to fix this, but it would make the output more verbose than it needs to be in general. We could also call @sprintf with both %f and %e and chose the shortest but it's a bit ad-hoc. I couldn't find a "better" printing method in packages (e.g. Format.jl).
Integers are always rendered using string, so they never use scientific notation. This fixes the previous problem in most cases, but if you have very large values it could be annoying. Not sure.
I have experimented with additional code (removed by second commit so that you can see it) which checks whether rounding would make the left break of an interval become lower than the highest value in the previous interval, which means that the generated label doesn't exactly reflect its contents. In that case increasing the number of significant digits gives a mathematically correct result. But in the end I considered this was too complex for the gain: 1) it only works for quantiles, as for the general method it would require additional work to find the highest value of the previous interval (though it's certainly doable and relatively cheap) -- and the inconsistency isn't great; 2) it's probably too clever for users, in the sense that the number of digits would vary depending on the data without any visible reason when looking at the breaks; 3) anyway people are probably not too attached to exact values (one indication of this is that R rounds and I never saw any mention of this).

`Statistics.quantile` returns values which are not the most appropriate to generate labels. It is more intuitive to choose values from the actual data, which are likely to have fewer decimals and make more sense for users. Unfortunately, since we use intervals closed on the left, we cannot use any of the seven standard definitions of quantiles. Type 1 is the closest, but we have to take the value next to it as a cutpoint to prevent it from being included into the next quantile group. This gives essentially consistent group attributions to R's `Hmisc::cut2` or `cut(x, quantile(x, (0:n)/n, type=1), include.lowest=T))`, though with different cutpoints in labels.

1) The quantile number isn't needed in most cases in the label, and anyway it's shown when printing an ordered `CategoricalValue`. Only use it by default when `allowempty=true` to avoid data-dependent errors if there are duplicate levels. 2) Round breaks by default to a number of significant digits chosen by `sigdigits`. This number is increased if necessary for breaks to remain unique. This generates labels which are not completely correct as rounding may make the left break greater than a value which is included in the interval, but this is generally minor and expected. Taking the floor rather than rounding would be more correct, but it can generate unexpected labels due to floating point trickiness (e.g. `floor(0.0003, sigdigits=4)` gives 0.0002999). This is what R does. Add a deprecation to avoid breaking custom `labels` functions which did not accept `sigdigits`.

andreasnoack · 2025-05-17T14:42:55Z

I think this looks good but I can see that it relies on functionality that isn't supported in 1.6. With a new LTS, I'm fine with increasing the minimum version but maybe it is easy to adjust the implementation to avoid the newer functionality. It also looks like there is still a segfault on Windows.

nalimilan · 2025-05-17T15:13:01Z

Yeah, I think I can easily use a slightly less efficient approach on Julia 1.6 (generating a format string on each call). The segfault on 32-bit Windows is always the same one related to Plots. I'll have to report it to them.

While you're at it, if you can have a look at the latest version of #416. :-)

nalimilan added 9 commits April 16, 2025 18:54

Improve choice of cutpoints

bbed3a2

WIP

9dd935b

Yet another approach

cac54b2

Small cleanup

364651f

Fix doctests

bf0b310

Indentation

daaa0cc

Simplify logic

e1acb38

nalimilan mentioned this pull request May 17, 2025

Compact printing in cut #381

Open

nalimilan added 3 commits May 17, 2025 12:23

Merge remote-tracking branch 'origin/master' into nl/quantilecut

e5d84c7

Merge branch 'nl/quantilecut' into nl/cutlabels

6ef6662

Fix test

311e593

nalimilan requested review from andreasnoack and bkamins May 17, 2025 11:57

nalimilan mentioned this pull request May 17, 2025

Support weighted quantiles in cut #423

Open

andreasnoack approved these changes May 17, 2025

View reviewed changes

bkamins approved these changes May 18, 2025

View reviewed changes

andreasnoack force-pushed the nl/quantilecut branch from e5d84c7 to a33a853 Compare May 18, 2025 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify default `cut` labels #422

Simplify default `cut` labels #422

nalimilan commented May 17, 2025

andreasnoack commented May 17, 2025

nalimilan commented May 17, 2025

Simplify default cut labels #422

Are you sure you want to change the base?

Simplify default cut labels #422

Conversation

nalimilan commented May 17, 2025

andreasnoack commented May 17, 2025

nalimilan commented May 17, 2025

Simplify default `cut` labels #422

Simplify default `cut` labels #422