Skip to content

Complete re-write of rectangular binnings to use ranges #246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Jan 14, 2023
Merged

Conversation

Datseris
Copy link
Member

@Datseris Datseris commented Jan 14, 2023

Our binning code was really bad when it came down to real world usage. When preparing the workshop, showing the ouputs of value histogram was alwas unintuitive. This thing we do with n_eps and nextfloat always leads to completely random and unintuitive numbers for the histogram edges. it also is very hard to get "the expected histogram" for different distribtions, hence computing the KL divergence. Furthermore, it is fundamentally inaccurate.

A much better approach is to give up trying to "hack up" some accuracy ourselves, and instead take advantage of Julia's base range system that operates using TwicePrecision to always keep range step sizes what the user expects without dealing with floating point precision. So that range 0:0.1:1 has exactly 0.1 step and exactly length of 11.

I have fully re-written the internals of rectangular binnings to utilize ranges. This has lead to many, many benefits:

  1. RectangularBinning is an intermediate struct that gets cast into a FixedRectangularBinning. This reduces a lot the code.
  2. All the hacky stuff we do with n_eps have been completely removed. They were never accurate to begin with; they just changed the histogram sizes but they were just as inaccurate. To be accurate you need double precision.
  3. FixedRectangularBinning now takes in standard julia ranges as input. One range for each dimension, with convenience constructors. This allows us to utilize Julia's internal double precision system without any hacky stuff. This also means taht the outcome space has nice, simple edges and bin widths, which is what a user would like..
  4. Binnings have a precise option: if true, they use searchsortedlast, which uses internally the double precision, to map data to correct bin according to ranges. If false they use our standard division with the bin width.

To give an example of how much of achange this does, here we go:

    x = Dataset(rand(Random.MersenneTwister(1234), 100_000, 2))
    push!(x, SVector(0, 0)) # ensure both 0 and 1 have values in, exactly.
    push!(x, SVector(1, 1))

    bin = FixedRectangularBinning(0:0.1:1, 2)
    est = ValueHistogram(bin)
    out = outcome_space(est, x)
# equivalent with 
outcome_space(bin)
julia> out
10×10 Matrix{SVector{2, Float64}}:
 [0.0, 0.0]  [0.0, 0.1]  …  [0.0, 0.9]
 [0.1, 0.0]  [0.1, 0.1]     [0.1, 0.9]
 [0.2, 0.0]  [0.2, 0.1]     [0.2, 0.9]
 [0.3, 0.0]  [0.3, 0.1]     [0.3, 0.9]
 [0.4, 0.0]  [0.4, 0.1]     [0.4, 0.9]
 [0.5, 0.0]  [0.5, 0.1]  …  [0.5, 0.9]
 [0.6, 0.0]  [0.6, 0.1]     [0.6, 0.9]
 [0.7, 0.0]  [0.7, 0.1]     [0.7, 0.9]
 [0.8, 0.0]  [0.8, 0.1]     [0.8, 0.9]
 [0.9, 0.0]  [0.9, 0.1]     [0.9, 0.9]

@kahaaga
Copy link
Member

kahaaga commented Jan 14, 2023

@Datseris This seems reasonable, but I think it breaks upstream code at the moment.

To get the joint histograms for multi-argument functions I simply do (with the old code)

function encode_as_tuple(e::RectangularBinEncoding, point)
    (; mini, edgelengths) = e
    # Map a data point to its bin edge (plus one because indexing starts from 1)
    bin = floor.(Int, (point .- mini) ./ edgelengths) .+ 1
    return bin # returns
end

For a D-dimensional point, this returns a D-dimensional tuple of integers (one for each dimension, indicating which bin along that dimension the coordinate falls in). How would I do that with your proposal?

@Datseris
Copy link
Member Author

from source code of encode

    if e.precise
        # Don't know how to make this faster unfurtunately...
        cartidx = CartesianIndex(map(searchsortedlast, ranges, Tuple(point)))
    else
        bin = floor.(Int, (point .- e.mini) ./ e.widths) .+ 1
        cartidx = CartesianIndex(Tuple(bin))
    end

@Datseris
Copy link
Member Author

I'll extract this into a function cartesian_bin_index that is called by encode so that you can use that function downstream, ok?

@kahaaga
Copy link
Member

kahaaga commented Jan 14, 2023

I'll extract this into a function cartesian_bin_index that is called by encode so that you can use that function downstream, ok?

Excellent.

@Datseris
Copy link
Member Author

Fixing the tests of the Transfer Operator is very hard. I am getting

 ArgumentError: Cannot decode integer -1: out of bounds of underlying binning.

@Datseris
Copy link
Member Author

Datseris commented Jan 14, 2023

There is just so much in this source code that isn't used, makes it so hard to read the source code. In this block

            # Count how many points jump from the i-th bin to each of
            # the unique target bins, and use that to calculate the transition
            # probability from bᵢ to bⱼ.
            for (j, bᵤ) in enumerate(unique(target_bins))
                n_transitions_i_to_j = sum(target_bins .== bᵤ)

                push!(I, i)
                push!(J, bᵤ)
                push!(P, n_transitions_i_to_j / n_visitsᵢ)
            end

j is not used anywhere. Interestingly, some variables have j in their name, and the usage of capital J also confuses.

@kahaaga
Copy link
Member

kahaaga commented Jan 14, 2023

Fixing the tests of the Transfer Operator is very hard.

I'll have a look. Tag me when you're done changing things, so we don't do overlapping work

@kahaaga
Copy link
Member

kahaaga commented Jan 14, 2023

There is just so much in this source code that isn't used, makes it so hard to read the source code. In this block

Yes, I know. This code is ancient and is a direct rewrite of some messy matlab code from back in the days. As we talked about before, it will be fixed as part of #55.

But the issue shouldn't be in the loop. If the bins are computed correctly and has the expected format before the loops, then the transfer operator approximation should be correct.

@Datseris
Copy link
Member Author

I found the issue. Something is fishy is going on with the encodings

@testset "All points covered" begin
    # Ensure that given a `RectangularBinning` no point is in invalid bin

    x = Dataset(rand(100, 2))

    binnings = [
        RectangularBinning(3),
        RectangularBinning(0.2),
        RectangularBinning([2, 3]),
        RectangularBinning([0.2, 0.3]),
    ]

    for bin in binnings
        rbe = RectangularBinEncoding(bin, x)
        visited_bins = map(pᵢ -> encode(rbe, pᵢ), x)
        @test -1  visited_bins
    end
end

This errors. I'll fix this now. Or at least I'll try.

@Datseris
Copy link
Member Author

Well, to be precise, this is also a problem in the Tranfer Operator code. If you allow for FixedRectangularBinning to be given, you must be able to deal with points given the encoding -1, because that's something the fixed binnings support.

@kahaaga
Copy link
Member

kahaaga commented Jan 14, 2023

Well, to be precise, this is also a problem in the Tranfer Operator code. If you allow for FixedRectangularBinning to be given, you must be able to deal with points given the encoding -1, because that's something the fixed binnings support.

The transfer operator is approximated by how a locally linear map transforms points. An implicit assumption here is that the points are supported on the grid on which the approximation is made. It should be fine to just drop any point where one or more components are encoded as -1. We can just add a filter to the line visited_bins = map(pᵢ -> encode(encoder, pᵢ), pts), were any pᵢ that has a -1 as a component is simply dropped.

I've always made sure that the binning used covers all the point a priori, so this hadn't crossed my mind before. My mistake.

@Datseris
Copy link
Member Author

this should be done in a different pr.

for now I found the obvious problem. When makign the range range(min, max; step = x) it is not guaranteed that max is within the range, something that RectangularBinning promises. I fix it now.

@Datseris Datseris merged commit e2cf0c8 into main Jan 14, 2023
@Datseris Datseris deleted the valuefixed branch January 14, 2023 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants