Uninformative Error Message when Sample Size is 1

I am encountering an issue where if I pass a Turing model of type `DynamicPPL.Model` to `psis_loo()` it returns values of `-Inf` or `NaN`.

However, when I pass it a log-likelihood function and the data as an array of tuples, it behaves "correctly" (in the sense that it returns finite values, but I have not independently verified that they are accurate).

The following is a reproducible example.  

```julia
using Distributions
using DataFrames
using Turing
using ParetoSmooth

function generate_persons(; n_years=1000, max_age=65, n_births=20, age_of_marriage=18)
    # https://github.com/rmcelreath/rethinking/blob/master/R/sim_happiness.R

    invlogit(x) = 1 ./ (exp(-x) .+ 1)
    age = fill(1, n_births)
    married = fill(false, n_births)
    happiness = collect(range(-2, 2, length=n_births))

    for _ in 1:n_years
        age .+= 1
        append!(age, fill(1, n_births))
        append!(married, fill(false, n_births))
        append!(happiness, collect(range(-2, 2, length=n_births)))

        for i in eachindex(age, married, happiness)
            if age[i] >= age_of_marriage && !married[i]
                married[i] = rand(Bernoulli(invlogit(happiness[i] - 4)))
            end
        end

        deaths = findall(age .> max_age)
        deleteat!(age, deaths)
        deleteat!(married, deaths)
        deleteat!(happiness, deaths)
    end

    DataFrame(
        age = age,
        married = convert.(Int, married),
        happiness = happiness
    )
end

persons = generate_persons()
adults = persons[persons.age .> 17, :]
min_age, max_age = extrema(adults.age)
adults.age = (adults.age .- min_age) ./ (max_age - min_age)

@model function adult_happiness(h, a)
    α ~ Normal(0, 1)
    β ~ Normal(0, 2)
    σ ~ Exponential(1)
    μ = α .+ β .* a
    h ~ MvNormal(μ, σ)
end

model = adult_happiness(adults.happiness, adults.age)
post = sample(model, NUTS(), 10_000)
```

Now, reading through the code in `src/TuringHelpers.jl` in this repo, it seems as though there's a method that supports Turing models directly. However, if I run
```julia
psis_loo(model, post)
```
then I get
```
┌ Warning: Some Pareto k values are extremely high (>1). PSIS will not produce consistent estimates.
└ @ ParetoSmooth ~/.julia/packages/ParetoSmooth/agZDP/src/InternalHelpers.jl:43
Results of PSIS-LOO-CV with 10000 Monte Carlo samples and 1 data points. Total Monte Carlo SE of NaN.
┌───────────┬───────┬──────────┬───────┬─────────┐
│           │ total │ se_total │  mean │ se_mean │
├───────────┼───────┼──────────┼───────┼─────────┤
│   cv_elpd │  -Inf │      NaN │  -Inf │     NaN │
│ naive_lpd │  -Inf │      NaN │  -Inf │     NaN │
│     p_eff │   NaN │      NaN │   NaN │     NaN │
└───────────┴───────┴──────────┴───────┴─────────┘
```

Instead, I can define the appropriate log-likelihood function and provide the data in an array of tuples:

```julia
function ll_fun(α, β, σ, data)
    h, a = data
    logpdf(Normal(α .+ β .* a, σ), h)
end

psis_loo(ll_fun, post, collect(zip(adults.happiness, adults.age)))
```

And this yields something that _seems_ correct.
```
[ Info: No source provided for samples; variables are assumed to be from a Markov Chain. If the samples are independent, specify this with keyword argument `source=:other`.
Results of PSIS-LOO-CV with 10000 Monte Carlo samples and 960 data points. Total Monte Carlo SE of 0.019.
┌───────────┬──────────┬──────────┬───────┬─────────┐
│           │    total │ se_total │  mean │ se_mean │
├───────────┼──────────┼──────────┼───────┼─────────┤
│   cv_elpd │ -1551.04 │    13.81 │ -1.62 │    0.01 │
│ naive_lpd │ -1548.62 │    13.75 │ -1.61 │    0.01 │
│     p_eff │     2.42 │     0.08 │  0.00 │    0.00 │
└───────────┴──────────┴──────────┴───────┴─────────┘
```

I would expect these two approaches to yield the same result.

Am I simply using `psis_loo()` incorrectly when passing it a Turing model of type `DynamicPPL.Model`? The language that `Results of PSIS-LOO-CV with 10000 Monte Carlo samples and 1 data points` seems to suggest that it does not recognize that `model` contains multiple data points.

For now I will continue to use the `ll_fun` approach, but wanted to call this out here in case there is something that needs a closer look in `TuringHelpers.jl`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uninformative Error Message when Sample Size is 1 #67

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uninformative Error Message when Sample Size is 1 #67

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions