-
Notifications
You must be signed in to change notification settings - Fork 35
Choose different quantile cutpoints in cut(x, n)
#416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that CI fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Unrelated to the changes here but I think it would be useful to test with pre
instead of nightly
for CI. Also, do you know if the Plots related segfault on Windows has been reported elsewhere?
Actually this way of choosing cutpoints isn't ideal in some corner cases such as this one: julia> cut([1, 1, 1, 1, 2], 2)
ERROR: ArgumentError: cannot compute 2 quantiles due to too many duplicated values in `x`. Pass `allowempty=true` to allow empty quantiles or choose a lower value for `ngroups`.
julia> levels(cut([1, 1, 1, 1, 2], 2, allowempty=true))
2-element Vector{String}:
"Q1: (1, 1)"
"Q2: [1, 2]" See also pola-rs/polars#10468 (comment). In general for left-closed intervals it seems better to take the next different value above the quantile point. This gives: julia> levels(cut([1, 1, 1, 1, 2], 2))
2-element Vector{String}:
"Q1: [1, 2)"
"Q2: [2, 2]"
julia> levels(cut([0, 1, 1, 1, 1], 2))
2-element Vector{String}:
"Q1: [0, 1)"
"Q2: [1, 1]" I've tested it and it gives identical groupings to R's |
I've pushed a new commit implementing the new approach. You can look at the diff of (doc)tests to see cases where it gives more useful results. For reference, I used this kind of code to check the equivalence with R: using Test, RCall, CategoricalArrays
for _ in 1:100, x in (rand(1:100, 10), rand(1:1000, 100)), n in 1:length(x)
local x2
try
#x3 = rcopy(R"as.integer(Hmisc::cut2($x, g=$n))")
x2 = rcopy(R"as.integer(cut($x, quantile($x, (0:$n)/$n, type=1), include.lowest=T))")
catch
continue
end
x1 = levelcode.(cut(x, n))
#@test x2 == x3
@test x1 == x2
end |
Thank you for working on this. I understand that for 1.0 release we consider this as acceptable change. |
I've pushed a commit implementing yet another approach which I find better in the end. It returns exactly the same group assignments as current master, the only difference is that breaks are taken from actual values in the data, which are generally nicer and in a more appropriate type (e.g. no float for integer inputs). I see several advantages to this:
The drawbacks is that some examples I showed above do not give optimal results ( The code I used to check equivalence with R and Polars (though the latter gives a lot of problematic results): using Test, RCall, CategoricalArrays
for _ in 1:100, x in (rand(1:10, 10), rand(1:100, 100)), n in 1:length(x)
@show x, n
local x2, x4
try
x2 = rcopy(R"cut($x, quantile($x, (0:$n)/$n, type=7), right=F, include.lowest=T)")
x4 = rcopy(R"polars::pl$Series(\"x\", $x)$qcut($n, left_closed=T) |> as.vector()")
droplevels!(x2)
length(levels(x2)) < n && throw(ErrorException("empty group!"))
catch
continue
end
x1 = cut(x, n)
@test levelcode.(x1) == levelcode.(x2)
#@test levelcode.(x1) == levelcode.(x4)
end |
`Statistics.quantile` returns values which are not the most appropriate to generate labels. It is more intuitive to choose values from the actual data, which are likely to have fewer decimals and make more sense for users. Unfortunately, since we use intervals closed on the left, we cannot use any of the seven standard definitions of quantiles. Type 1 is the closest, but we have to take the value next to it as a cutpoint to prevent it from being included into the next quantile group. This gives essentially consistent group attributions to R's `Hmisc::cut2` or `cut(x, quantile(x, (0:n)/n, type=1), include.lowest=T))`, though with different cutpoints in labels.
e5d84c7
to
a33a853
Compare
Statistics.quantile
returns values which are not the most appropriate to generate labels. It is more intuitive to choose values from the actual data, which are likely to have fewer decimals and make more sense for users.Unfortunately, since we use intervals closed on the left, we cannot use any of the seven standard definitions of quantiles. Type 1 is the closest, but we have to take the value next to it as a cutpoint to prevent it from being included into the next quantile group. This gives essentially consistent group attributions to R's
Hmisc::cut2
orcut(x, quantile(x, (0:n)/n, type=1), include.lowest=T))
, though with different cutpoints in labels.@andreasnoack @bkamins So in the end it seems I can't reuse any of JuliaStats/Statistics.jl#187. I'll finish that PR nethertheless. I'll also make other PRs against CategoricalArrays to simplify labels.
There are still a few weird cases where R gives slightly different results. This seems to happen when the quantile falls on a series of duplicated values... Hmisc's code is quite hard to follow so I'm not sure what's happening. An example of a failure is this: we put 681 in quintile 4 instead of 3.
EDIT: maybe that's fine and there's no clearly superior solution when the quantile falls in the middle of a series of duplicate values?