Choose different quantile cutpoints in `cut(x, n)` #416

nalimilan · 2025-04-16T17:10:28Z

Statistics.quantile returns values which are not the most appropriate to generate labels. It is more intuitive to choose values from the actual data, which are likely to have fewer decimals and make more sense for users.

Unfortunately, since we use intervals closed on the left, we cannot use any of the seven standard definitions of quantiles. Type 1 is the closest, but we have to take the value next to it as a cutpoint to prevent it from being included into the next quantile group. This gives essentially consistent group attributions to R's Hmisc::cut2 or
cut(x, quantile(x, (0:n)/n, type=1), include.lowest=T)), though with different cutpoints in labels.

@andreasnoack @bkamins So in the end it seems I can't reuse any of JuliaStats/Statistics.jl#187. I'll finish that PR nethertheless. I'll also make other PRs against CategoricalArrays to simplify labels.

There are still a few weird cases where R gives slightly different results. This seems to happen when the quantile falls on a series of duplicated values... Hmisc's code is quite hard to follow so I'm not sure what's happening. An example of a failure is this: we put 681 in quintile 4 instead of 3.

julia> using CategoricalArrays, RCall

julia> x = [793, 465, 359, 681, 750, 701, 332, 222, 681, 262, 952, 907, 679, 198, 733, 438, 167, 59, 429, 686];

julia> n = 5;

julia> x1 = cut(x, n);

julia> x2 = rcopy(R"cut($x, quantile($x, (0:$n)/$n, type=1), include.lowest=T)");

julia> x3 = rcopy(R"Hmisc::cut2($x, g=$n)");

julia> (x1[4], levelcode(x1[4]))
(CategoricalValue{String, UInt32} "Q4: [681, 750)" (4/5), 4)

julia> (x2[4], levelcode(x2[4]))
(CategoricalValue{String, UInt32} "(429,681]", 3)

julia> (x3[4], levelcode(x3[4]))
(CategoricalValue{String, UInt32} "[438,686)", 3)

julia> sort(x)[Int(3*20/n)]
681

julia> sort(x)[Int(3*20/n) + 1]
681

julia> sort(x)[Int(3*20/n) + 2]
686

EDIT: maybe that's fine and there's no clearly superior solution when the quantile falls in the middle of a series of duplicate values?

bkamins

Note that CI fails.

andreasnoack

Looks good to me.

Unrelated to the changes here but I think it would be useful to test with pre instead of nightly for CI. Also, do you know if the Plots related segfault on Windows has been reported elsewhere?

nalimilan · 2025-04-18T21:56:25Z

Actually this way of choosing cutpoints isn't ideal in some corner cases such as this one:

julia> cut([1, 1, 1, 1, 2], 2)
ERROR: ArgumentError: cannot compute 2 quantiles due to too many duplicated values in `x`. Pass `allowempty=true` to allow empty quantiles or choose a lower value for `ngroups`.

julia> levels(cut([1, 1, 1, 1, 2], 2, allowempty=true))
2-element Vector{String}:
 "Q1: (1, 1)"
 "Q2: [1, 2]"

See also pola-rs/polars#10468 (comment). Hmisc::cut2 has the reverse problem and returns two groups for [1, 1, 1, 1, 2] but a single group for [0, 1, 1, 1, 1].

In general for left-closed intervals it seems better to take the next different value above the quantile point. This gives:

julia> levels(cut([1, 1, 1, 1, 2], 2))
2-element Vector{String}:
 "Q1: [1, 2)"
 "Q2: [2, 2]"

julia> levels(cut([0, 1, 1, 1, 1], 2))
2-element Vector{String}:
 "Q1: [0, 1)"
 "Q2: [1, 1]"

I've tested it and it gives identical groupings to R's cut(x, quantile(x, (0:n)/n, type=1), include.lowest=T) (at least in the absence of duplicate breaks as R errors in that case). So that's an appealing solution, but I'd like to make sure it doesn't create problems in other places...

nalimilan · 2025-04-19T10:24:25Z

I've pushed a new commit implementing the new approach. You can look at the diff of (doc)tests to see cases where it gives more useful results.

For reference, I used this kind of code to check the equivalence with R:

using Test, RCall, CategoricalArrays
for _ in 1:100, x in (rand(1:100, 10), rand(1:1000, 100)), n in 1:length(x)
    local x2
    try
        #x3 = rcopy(R"as.integer(Hmisc::cut2($x, g=$n))")
        x2 = rcopy(R"as.integer(cut($x, quantile($x, (0:$n)/$n, type=1), include.lowest=T))")
    catch
        continue
    end
    x1 = levelcode.(cut(x, n))
    #@test x2 == x3
    @test x1 == x2
end

bkamins · 2025-04-20T16:51:10Z

Thank you for working on this. I understand that for 1.0 release we consider this as acceptable change.

nalimilan · 2025-04-27T21:30:04Z

I've pushed a commit implementing yet another approach which I find better in the end. It returns exactly the same group assignments as current master, the only difference is that breaks are taken from actual values in the data, which are generally nicer and in a more appropriate type (e.g. no float for integer inputs). I see several advantages to this:

It's less breaking for users as it only changes labels.
It's easy to replicate using quantile (group assignments at least, breaks are hard to recompute) and it uses the default quantile type.
It matches what other implementations do, notably cut(x, quantile(x, (0:n)/n), right=FALSE) in R (which is used by several R packages that implement convenience functions similar to cut(x, n)) and cut(x, n, left_closed=true) in Polars (though the latter seems to be buggy in some cases).
It will be easy to extend in a consistent way if at some point we support right_closed=true.

The drawbacks is that some examples I showed above do not give optimal results (cut([1, 1, 1, 1, 2], 2) gives an empty group and all values in the other one), but this isn't a big deal in practice. The result also differs from Hmisc::cut2, but there's no reason to give more importance to this (underdocumented) implementation than others.

The code I used to check equivalence with R and Polars (though the latter gives a lot of problematic results):

using Test, RCall, CategoricalArrays
for _ in 1:100, x in (rand(1:10, 10), rand(1:100, 100)), n in 1:length(x)
    @show x, n
    local x2, x4
    try
        x2 = rcopy(R"cut($x, quantile($x, (0:$n)/$n, type=7), right=F, include.lowest=T)")
        x4 = rcopy(R"polars::pl$Series(\"x\", $x)$qcut($n, left_closed=T) |> as.vector()")
        droplevels!(x2)
        length(levels(x2)) < n && throw(ErrorException("empty group!"))
    catch
        continue
    end
    x1 = cut(x, n)
    @test levelcode.(x1) == levelcode.(x2)
    #@test levelcode.(x1) == levelcode.(x4)
end

`Statistics.quantile` returns values which are not the most appropriate to generate labels. It is more intuitive to choose values from the actual data, which are likely to have fewer decimals and make more sense for users. Unfortunately, since we use intervals closed on the left, we cannot use any of the seven standard definitions of quantiles. Type 1 is the closest, but we have to take the value next to it as a cutpoint to prevent it from being included into the next quantile group. This gives essentially consistent group attributions to R's `Hmisc::cut2` or `cut(x, quantile(x, (0:n)/n, type=1), include.lowest=T))`, though with different cutpoints in labels.

nalimilan requested review from andreasnoack and bkamins April 16, 2025 17:10

bkamins approved these changes Apr 17, 2025

View reviewed changes

andreasnoack approved these changes Apr 17, 2025

View reviewed changes

nalimilan added the breaking label Apr 18, 2025

nalimilan mentioned this pull request May 17, 2025

Simplify default cut labels #422

Open

andreasnoack approved these changes May 17, 2025

View reviewed changes

nalimilan added 7 commits May 18, 2025 21:35

Improve choice of cutpoints

9787d61

WIP

2efe695

Yet another approach

6a913c7

Small cleanup

844ff42

Fix doctests

b54c9fb

Indentation

a33a853

andreasnoack force-pushed the nl/quantilecut branch from e5d84c7 to a33a853 Compare May 18, 2025 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose different quantile cutpoints in `cut(x, n)` #416

Choose different quantile cutpoints in `cut(x, n)` #416

nalimilan commented Apr 16, 2025 •

edited

Loading

bkamins left a comment

andreasnoack left a comment

nalimilan commented Apr 18, 2025

nalimilan commented Apr 19, 2025

bkamins commented Apr 20, 2025

nalimilan commented Apr 27, 2025

Choose different quantile cutpoints in cut(x, n) #416

Are you sure you want to change the base?

Choose different quantile cutpoints in cut(x, n) #416

Conversation

nalimilan commented Apr 16, 2025 • edited Loading

bkamins left a comment

Choose a reason for hiding this comment

andreasnoack left a comment

Choose a reason for hiding this comment

nalimilan commented Apr 18, 2025

nalimilan commented Apr 19, 2025

bkamins commented Apr 20, 2025

nalimilan commented Apr 27, 2025

Choose different quantile cutpoints in `cut(x, n)` #416

Choose different quantile cutpoints in `cut(x, n)` #416

nalimilan commented Apr 16, 2025 •

edited

Loading