Skip to content

Choose different quantile cutpoints in cut(x, n) #416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Apr 16, 2025

Statistics.quantile returns values which are not the most appropriate to generate labels. It is more intuitive to choose values from the actual data, which are likely to have fewer decimals and make more sense for users.

Unfortunately, since we use intervals closed on the left, we cannot use any of the seven standard definitions of quantiles. Type 1 is the closest, but we have to take the value next to it as a cutpoint to prevent it from being included into the next quantile group. This gives essentially consistent group attributions to R's Hmisc::cut2 or
cut(x, quantile(x, (0:n)/n, type=1), include.lowest=T)), though with different cutpoints in labels.

@andreasnoack @bkamins So in the end it seems I can't reuse any of JuliaStats/Statistics.jl#187. I'll finish that PR nethertheless. I'll also make other PRs against CategoricalArrays to simplify labels.

There are still a few weird cases where R gives slightly different results. This seems to happen when the quantile falls on a series of duplicated values... Hmisc's code is quite hard to follow so I'm not sure what's happening. An example of a failure is this: we put 681 in quintile 4 instead of 3.

julia> using CategoricalArrays, RCall

julia> x = [793, 465, 359, 681, 750, 701, 332, 222, 681, 262, 952, 907, 679, 198, 733, 438, 167, 59, 429, 686];

julia> n = 5;

julia> x1 = cut(x, n);

julia> x2 = rcopy(R"cut($x, quantile($x, (0:$n)/$n, type=1), include.lowest=T)");

julia> x3 = rcopy(R"Hmisc::cut2($x, g=$n)");

julia> (x1[4], levelcode(x1[4]))
(CategoricalValue{String, UInt32} "Q4: [681, 750)" (4/5), 4)

julia> (x2[4], levelcode(x2[4]))
(CategoricalValue{String, UInt32} "(429,681]", 3)

julia> (x3[4], levelcode(x3[4]))
(CategoricalValue{String, UInt32} "[438,686)", 3)

julia> sort(x)[Int(3*20/n)]
681

julia> sort(x)[Int(3*20/n) + 1]
681

julia> sort(x)[Int(3*20/n) + 2]
686

EDIT: maybe that's fine and there's no clearly superior solution when the quantile falls in the middle of a series of duplicate values?

Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that CI fails.

Copy link
Member

@andreasnoack andreasnoack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Unrelated to the changes here but I think it would be useful to test with pre instead of nightly for CI. Also, do you know if the Plots related segfault on Windows has been reported elsewhere?

@nalimilan
Copy link
Member Author

Actually this way of choosing cutpoints isn't ideal in some corner cases such as this one:

julia> cut([1, 1, 1, 1, 2], 2)
ERROR: ArgumentError: cannot compute 2 quantiles due to too many duplicated values in `x`. Pass `allowempty=true` to allow empty quantiles or choose a lower value for `ngroups`.

julia> levels(cut([1, 1, 1, 1, 2], 2, allowempty=true))
2-element Vector{String}:
 "Q1: (1, 1)"
 "Q2: [1, 2]"

See also pola-rs/polars#10468 (comment). Hmisc::cut2 has the reverse problem and returns two groups for [1, 1, 1, 1, 2] but a single group for [0, 1, 1, 1, 1].

In general for left-closed intervals it seems better to take the next different value above the quantile point. This gives:

julia> levels(cut([1, 1, 1, 1, 2], 2))
2-element Vector{String}:
 "Q1: [1, 2)"
 "Q2: [2, 2]"

julia> levels(cut([0, 1, 1, 1, 1], 2))
2-element Vector{String}:
 "Q1: [0, 1)"
 "Q2: [1, 1]"

I've tested it and it gives identical groupings to R's cut(x, quantile(x, (0:n)/n, type=1), include.lowest=T) (at least in the absence of duplicate breaks as R errors in that case). So that's an appealing solution, but I'd like to make sure it doesn't create problems in other places...

@nalimilan
Copy link
Member Author

I've pushed a new commit implementing the new approach. You can look at the diff of (doc)tests to see cases where it gives more useful results.

For reference, I used this kind of code to check the equivalence with R:

using Test, RCall, CategoricalArrays
for _ in 1:100, x in (rand(1:100, 10), rand(1:1000, 100)), n in 1:length(x)
    local x2
    try
        #x3 = rcopy(R"as.integer(Hmisc::cut2($x, g=$n))")
        x2 = rcopy(R"as.integer(cut($x, quantile($x, (0:$n)/$n, type=1), include.lowest=T))")
    catch
        continue
    end
    x1 = levelcode.(cut(x, n))
    #@test x2 == x3
    @test x1 == x2
end

@bkamins
Copy link
Member

bkamins commented Apr 20, 2025

Thank you for working on this. I understand that for 1.0 release we consider this as acceptable change.

@nalimilan
Copy link
Member Author

I've pushed a commit implementing yet another approach which I find better in the end. It returns exactly the same group assignments as current master, the only difference is that breaks are taken from actual values in the data, which are generally nicer and in a more appropriate type (e.g. no float for integer inputs). I see several advantages to this:

  • It's less breaking for users as it only changes labels.
  • It's easy to replicate using quantile (group assignments at least, breaks are hard to recompute) and it uses the default quantile type.
  • It matches what other implementations do, notably cut(x, quantile(x, (0:n)/n), right=FALSE) in R (which is used by several R packages that implement convenience functions similar to cut(x, n)) and cut(x, n, left_closed=true) in Polars (though the latter seems to be buggy in some cases).
  • It will be easy to extend in a consistent way if at some point we support right_closed=true.

The drawbacks is that some examples I showed above do not give optimal results (cut([1, 1, 1, 1, 2], 2) gives an empty group and all values in the other one), but this isn't a big deal in practice. The result also differs from Hmisc::cut2, but there's no reason to give more importance to this (underdocumented) implementation than others.

The code I used to check equivalence with R and Polars (though the latter gives a lot of problematic results):

using Test, RCall, CategoricalArrays
for _ in 1:100, x in (rand(1:10, 10), rand(1:100, 100)), n in 1:length(x)
    @show x, n
    local x2, x4
    try
        x2 = rcopy(R"cut($x, quantile($x, (0:$n)/$n, type=7), right=F, include.lowest=T)")
        x4 = rcopy(R"polars::pl$Series(\"x\", $x)$qcut($n, left_closed=T) |> as.vector()")
        droplevels!(x2)
        length(levels(x2)) < n && throw(ErrorException("empty group!"))
    catch
        continue
    end
    x1 = cut(x, n)
    @test levelcode.(x1) == levelcode.(x2)
    #@test levelcode.(x1) == levelcode.(x4)
end

nalimilan added 7 commits May 18, 2025 21:35
`Statistics.quantile` returns values which are not the most appropriate
to generate labels. It is more intuitive to choose values from the actual data,
which are likely to have fewer decimals and make more sense for users.

Unfortunately, since we use intervals closed on the left, we cannot use
any of the seven standard definitions of quantiles. Type 1 is the closest,
but we have to take the value next to it as a cutpoint to prevent it from
being included into the next quantile group. This gives essentially consistent
group attributions to R's `Hmisc::cut2` or
`cut(x, quantile(x, (0:n)/n, type=1), include.lowest=T))`,
though with different cutpoints in labels.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants