Complete re-write of rectangular binnings to use ranges #246

Datseris · 2023-01-14T12:56:17Z

Our binning code was really bad when it came down to real world usage. When preparing the workshop, showing the ouputs of value histogram was alwas unintuitive. This thing we do with n_eps and nextfloat always leads to completely random and unintuitive numbers for the histogram edges. it also is very hard to get "the expected histogram" for different distribtions, hence computing the KL divergence. Furthermore, it is fundamentally inaccurate.

A much better approach is to give up trying to "hack up" some accuracy ourselves, and instead take advantage of Julia's base range system that operates using TwicePrecision to always keep range step sizes what the user expects without dealing with floating point precision. So that range 0:0.1:1 has exactly 0.1 step and exactly length of 11.

I have fully re-written the internals of rectangular binnings to utilize ranges. This has lead to many, many benefits:

RectangularBinning is an intermediate struct that gets cast into a FixedRectangularBinning. This reduces a lot the code.
All the hacky stuff we do with n_eps have been completely removed. They were never accurate to begin with; they just changed the histogram sizes but they were just as inaccurate. To be accurate you need double precision.
FixedRectangularBinning now takes in standard julia ranges as input. One range for each dimension, with convenience constructors. This allows us to utilize Julia's internal double precision system without any hacky stuff. This also means taht the outcome space has nice, simple edges and bin widths, which is what a user would like..
Binnings have a precise option: if true, they use searchsortedlast, which uses internally the double precision, to map data to correct bin according to ranges. If false they use our standard division with the bin width.

To give an example of how much of achange this does, here we go:

    x = Dataset(rand(Random.MersenneTwister(1234), 100_000, 2))
    push!(x, SVector(0, 0)) # ensure both 0 and 1 have values in, exactly.
    push!(x, SVector(1, 1))

    bin = FixedRectangularBinning(0:0.1:1, 2)
    est = ValueHistogram(bin)
    out = outcome_space(est, x)
# equivalent with 
outcome_space(bin)

julia> out
10×10 Matrix{SVector{2, Float64}}:
 [0.0, 0.0]  [0.0, 0.1]  …  [0.0, 0.9]
 [0.1, 0.0]  [0.1, 0.1]     [0.1, 0.9]
 [0.2, 0.0]  [0.2, 0.1]     [0.2, 0.9]
 [0.3, 0.0]  [0.3, 0.1]     [0.3, 0.9]
 [0.4, 0.0]  [0.4, 0.1]     [0.4, 0.9]
 [0.5, 0.0]  [0.5, 0.1]  …  [0.5, 0.9]
 [0.6, 0.0]  [0.6, 0.1]     [0.6, 0.9]
 [0.7, 0.0]  [0.7, 0.1]     [0.7, 0.9]
 [0.8, 0.0]  [0.8, 0.1]     [0.8, 0.9]
 [0.9, 0.0]  [0.9, 0.1]     [0.9, 0.9]

kahaaga · 2023-01-14T14:28:55Z

@Datseris This seems reasonable, but I think it breaks upstream code at the moment.

To get the joint histograms for multi-argument functions I simply do (with the old code)

function encode_as_tuple(e::RectangularBinEncoding, point)
    (; mini, edgelengths) = e
    # Map a data point to its bin edge (plus one because indexing starts from 1)
    bin = floor.(Int, (point .- mini) ./ edgelengths) .+ 1
    return bin # returns
end

For a D-dimensional point, this returns a D-dimensional tuple of integers (one for each dimension, indicating which bin along that dimension the coordinate falls in). How would I do that with your proposal?

Datseris · 2023-01-14T14:40:39Z

from source code of encode

    if e.precise
        # Don't know how to make this faster unfurtunately...
        cartidx = CartesianIndex(map(searchsortedlast, ranges, Tuple(point)))
    else
        bin = floor.(Int, (point .- e.mini) ./ e.widths) .+ 1
        cartidx = CartesianIndex(Tuple(bin))
    end

Datseris · 2023-01-14T14:41:24Z

I'll extract this into a function cartesian_bin_index that is called by encode so that you can use that function downstream, ok?

kahaaga · 2023-01-14T14:42:03Z

I'll extract this into a function cartesian_bin_index that is called by encode so that you can use that function downstream, ok?

Excellent.

Datseris · 2023-01-14T14:57:17Z

Fixing the tests of the Transfer Operator is very hard. I am getting

 ArgumentError: Cannot decode integer -1: out of bounds of underlying binning.

Datseris · 2023-01-14T14:58:56Z

There is just so much in this source code that isn't used, makes it so hard to read the source code. In this block

            # Count how many points jump from the i-th bin to each of
            # the unique target bins, and use that to calculate the transition
            # probability from bᵢ to bⱼ.
            for (j, bᵤ) in enumerate(unique(target_bins))
                n_transitions_i_to_j = sum(target_bins .== bᵤ)

                push!(I, i)
                push!(J, bᵤ)
                push!(P, n_transitions_i_to_j / n_visitsᵢ)
            end

j is not used anywhere. Interestingly, some variables have j in their name, and the usage of capital J also confuses.

kahaaga · 2023-01-14T15:04:19Z

Fixing the tests of the Transfer Operator is very hard.

I'll have a look. Tag me when you're done changing things, so we don't do overlapping work

kahaaga · 2023-01-14T15:06:24Z

There is just so much in this source code that isn't used, makes it so hard to read the source code. In this block

Yes, I know. This code is ancient and is a direct rewrite of some messy matlab code from back in the days. As we talked about before, it will be fixed as part of #55.

But the issue shouldn't be in the loop. If the bins are computed correctly and has the expected format before the loops, then the transfer operator approximation should be correct.

Datseris · 2023-01-14T15:09:50Z

I found the issue. Something is fishy is going on with the encodings

@testset "All points covered" begin
    # Ensure that given a `RectangularBinning` no point is in invalid bin

    x = Dataset(rand(100, 2))

    binnings = [
        RectangularBinning(3),
        RectangularBinning(0.2),
        RectangularBinning([2, 3]),
        RectangularBinning([0.2, 0.3]),
    ]

    for bin in binnings
        rbe = RectangularBinEncoding(bin, x)
        visited_bins = map(pᵢ -> encode(rbe, pᵢ), x)
        @test -1 ∉ visited_bins
    end
end

This errors. I'll fix this now. Or at least I'll try.

Datseris · 2023-01-14T15:10:43Z

Well, to be precise, this is also a problem in the Tranfer Operator code. If you allow for FixedRectangularBinning to be given, you must be able to deal with points given the encoding -1, because that's something the fixed binnings support.

kahaaga · 2023-01-14T15:17:46Z

Well, to be precise, this is also a problem in the Tranfer Operator code. If you allow for FixedRectangularBinning to be given, you must be able to deal with points given the encoding -1, because that's something the fixed binnings support.

The transfer operator is approximated by how a locally linear map transforms points. An implicit assumption here is that the points are supported on the grid on which the approximation is made. It should be fine to just drop any point where one or more components are encoded as -1. We can just add a filter to the line visited_bins = map(pᵢ -> encode(encoder, pᵢ), pts), were any pᵢ that has a -1 as a component is simply dropped.

I've always made sure that the binning used covers all the point a priori, so this hadn't crossed my mind before. My mistake.

Datseris · 2023-01-14T15:19:20Z

this should be done in a different pr.

for now I found the obvious problem. When makign the range range(min, max; step = x) it is not guaranteed that max is within the range, something that RectangularBinning promises. I fix it now.

Datseris added 16 commits January 14, 2023 10:37

allow bin width to be given to fixed rect bin directly

0461bc1

remove incorrect edge application of nextfloat for given edge

e0edcdf

better doc for fixed rect bin

37514e8

better docuemnt and clarify bins

9933870

compelte rewrite of binnings based on ranges

a180cd0

make compilable

1aa1547

super simple outcome space

8c988c0

move deprecations to their appropriate file

6c3f14e

more clear docs for rect bin

119ce68

move deprecations to file

9ae3667

allow geting FRB from RB

c89427b

use nextfloat to ensure promise

ac40e53

increase performance of inprecise version

c4fd1de

correct extraction of length of range

053a905

re-write tests

e3e082d

more clarity in tests

dd73c8c

Datseris added 2 commits January 14, 2023 16:37

fix diversityy

56df9de

remove transferoperatorencoder in favor of RectangularBinEncoding

bc9fa6a

Datseris added 7 commits January 14, 2023 16:42

add catesian bin index internal function

90f6541

actually call the function

9555785

fix diversity: given bins + 1 to the range

976ee77

allow dispatch to encoding

14e534a

fix incorrect definition of outcome_space in TrOp

65faafd

fix some tests of TrOp...

77bb880

remove completely unused computations from function in TrOp...

201f504

remove more completely unused codfe from TrOp...

bb04774

remove more unused code and comments

e153741

Datseris added 2 commits January 14, 2023 17:24

ensre ranges are large enough to cover data in fixed witdth

937973c

add changelog entry

68e759f

Datseris merged commit e2cf0c8 into main Jan 14, 2023

Datseris deleted the valuefixed branch January 14, 2023 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Complete re-write of rectangular binnings to use ranges #246

Complete re-write of rectangular binnings to use ranges #246

Uh oh!

Datseris commented Jan 14, 2023 •

edited

Loading

Uh oh!

kahaaga commented Jan 14, 2023

Uh oh!

Datseris commented Jan 14, 2023

Uh oh!

Datseris commented Jan 14, 2023

Uh oh!

kahaaga commented Jan 14, 2023

Uh oh!

Datseris commented Jan 14, 2023

Uh oh!

Datseris commented Jan 14, 2023 •

edited

Loading

Uh oh!

kahaaga commented Jan 14, 2023

Uh oh!

kahaaga commented Jan 14, 2023

Uh oh!

Datseris commented Jan 14, 2023

Uh oh!

Datseris commented Jan 14, 2023

Uh oh!

kahaaga commented Jan 14, 2023 •

edited

Loading

Uh oh!

Datseris commented Jan 14, 2023

Uh oh!

Uh oh!

Complete re-write of rectangular binnings to use ranges #246

Complete re-write of rectangular binnings to use ranges #246

Uh oh!

Conversation

Datseris commented Jan 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kahaaga commented Jan 14, 2023

Uh oh!

Datseris commented Jan 14, 2023

Uh oh!

Datseris commented Jan 14, 2023

Uh oh!

kahaaga commented Jan 14, 2023

Uh oh!

Datseris commented Jan 14, 2023

Uh oh!

Datseris commented Jan 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kahaaga commented Jan 14, 2023

Uh oh!

kahaaga commented Jan 14, 2023

Uh oh!

Datseris commented Jan 14, 2023

Uh oh!

Datseris commented Jan 14, 2023

Uh oh!

kahaaga commented Jan 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Datseris commented Jan 14, 2023

Uh oh!

Uh oh!

Datseris commented Jan 14, 2023 •

edited

Loading

Datseris commented Jan 14, 2023 •

edited

Loading

kahaaga commented Jan 14, 2023 •

edited

Loading