Skip to content

Uninformative Error Message when Sample Size is 1 #67

@nsgrantham

Description

@nsgrantham

I am encountering an issue where if I pass a Turing model of type DynamicPPL.Model to psis_loo() it returns values of -Inf or NaN.

However, when I pass it a log-likelihood function and the data as an array of tuples, it behaves "correctly" (in the sense that it returns finite values, but I have not independently verified that they are accurate).

The following is a reproducible example.

using Distributions
using DataFrames
using Turing
using ParetoSmooth

function generate_persons(; n_years=1000, max_age=65, n_births=20, age_of_marriage=18)
    # https://github.com/rmcelreath/rethinking/blob/master/R/sim_happiness.R

    invlogit(x) = 1 ./ (exp(-x) .+ 1)
    age = fill(1, n_births)
    married = fill(false, n_births)
    happiness = collect(range(-2, 2, length=n_births))

    for _ in 1:n_years
        age .+= 1
        append!(age, fill(1, n_births))
        append!(married, fill(false, n_births))
        append!(happiness, collect(range(-2, 2, length=n_births)))

        for i in eachindex(age, married, happiness)
            if age[i] >= age_of_marriage && !married[i]
                married[i] = rand(Bernoulli(invlogit(happiness[i] - 4)))
            end
        end

        deaths = findall(age .> max_age)
        deleteat!(age, deaths)
        deleteat!(married, deaths)
        deleteat!(happiness, deaths)
    end

    DataFrame(
        age = age,
        married = convert.(Int, married),
        happiness = happiness
    )
end

persons = generate_persons()
adults = persons[persons.age .> 17, :]
min_age, max_age = extrema(adults.age)
adults.age = (adults.age .- min_age) ./ (max_age - min_age)

@model function adult_happiness(h, a)
    α ~ Normal(0, 1)
    β ~ Normal(0, 2)
    σ ~ Exponential(1)
    μ = α .+ β .* a
    h ~ MvNormal(μ, σ)
end

model = adult_happiness(adults.happiness, adults.age)
post = sample(model, NUTS(), 10_000)

Now, reading through the code in src/TuringHelpers.jl in this repo, it seems as though there's a method that supports Turing models directly. However, if I run

psis_loo(model, post)

then I get

┌ Warning: Some Pareto k values are extremely high (>1). PSIS will not produce consistent estimates.
└ @ ParetoSmooth ~/.julia/packages/ParetoSmooth/agZDP/src/InternalHelpers.jl:43
Results of PSIS-LOO-CV with 10000 Monte Carlo samples and 1 data points. Total Monte Carlo SE of NaN.
┌───────────┬───────┬──────────┬───────┬─────────┐
│           │ total │ se_total │  mean │ se_mean │
├───────────┼───────┼──────────┼───────┼─────────┤
│   cv_elpd │  -Inf │      NaN │  -Inf │     NaN │
│ naive_lpd │  -Inf │      NaN │  -Inf │     NaN │
│     p_eff │   NaN │      NaN │   NaN │     NaN │
└───────────┴───────┴──────────┴───────┴─────────┘

Instead, I can define the appropriate log-likelihood function and provide the data in an array of tuples:

function ll_fun(α, β, σ, data)
    h, a = data
    logpdf(Normal.+ β .* a, σ), h)
end

psis_loo(ll_fun, post, collect(zip(adults.happiness, adults.age)))

And this yields something that seems correct.

[ Info: No source provided for samples; variables are assumed to be from a Markov Chain. If the samples are independent, specify this with keyword argument `source=:other`.
Results of PSIS-LOO-CV with 10000 Monte Carlo samples and 960 data points. Total Monte Carlo SE of 0.019.
┌───────────┬──────────┬──────────┬───────┬─────────┐
│           │    total │ se_total │  mean │ se_mean │
├───────────┼──────────┼──────────┼───────┼─────────┤
│   cv_elpd │ -1551.04 │    13.81 │ -1.62 │    0.01 │
│ naive_lpd │ -1548.62 │    13.75 │ -1.61 │    0.01 │
│     p_eff │     2.42 │     0.08 │  0.00 │    0.00 │
└───────────┴──────────┴──────────┴───────┴─────────┘

I would expect these two approaches to yield the same result.

Am I simply using psis_loo() incorrectly when passing it a Turing model of type DynamicPPL.Model? The language that Results of PSIS-LOO-CV with 10000 Monte Carlo samples and 1 data points seems to suggest that it does not recognize that model contains multiple data points.

For now I will continue to use the ll_fun approach, but wanted to call this out here in case there is something that needs a closer look in TuringHelpers.jl.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions