Skip to content

Conversation

@vincentarelbundock
Copy link
Owner

@vincentarelbundock vincentarelbundock commented Jul 8, 2024

Install:

install.packages("topmodels", repos = "https://R-Forge.R-project.org")
remotes::install_github("vincentarelbundock/marginaleffects@topmodels")

Examples seem to work:

library("topmodels")
library("marginaleffects")

data("FIFA2018", package = "distributions3")
m <- glm(goals ~ difference, data = FIFA2018, family = poisson)
p <- promodel(m)

# Informative error
avg_slopes(p, type = "junk")
> Error in sanitize_type(model = model, type = type, calling_function = "slopes"): Assertion on 'type' failed: Must be element of set {'mean','variance','quantile','probability','density','loglikelihood','kurtosis','skewness'}, but is 'junk'.
# Mean
avg_slopes(p)
> 
>        Term    Contrast Estimate Std. Error    z Pr(>|z|)    S 2.5 % 97.5 %
>  difference mean(dY/dX)    0.536      0.143 3.74   <0.001 12.4 0.255  0.817
> 
> Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted 
> Type:  mean
# Probability
avg_slopes(p, type = "probability")
> 
>        Term    Contrast Estimate Std. Error     z Pr(>|z|)    S  2.5 %  97.5 %
>  difference mean(dY/dX)   -0.141     0.0333 -4.23   <0.001 15.4 -0.206 -0.0754
> 
> Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted 
> Type:  probability
# Probability at=3
avg_slopes(p, type = "probability", at = 3)
> 
>        Term    Contrast Estimate Std. Error     z Pr(>|z|)   S  2.5 %  97.5 %
>  difference mean(dY/dX)  -0.0605     0.0218 -2.78  0.00545 7.5 -0.103 -0.0178
> 
> Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted 
> Type:  probability
# Effect of a change of 2 units of `difference` 
# on the probability that teams score 3 goals or less
avg_comparisons(p,
    variables = list(difference = 2),
    type = "probability",
    at = 3)
> 
>        Term Contrast Estimate Std. Error     z Pr(>|z|)   S 2.5 %  97.5 %
>  difference mean(+2)   -0.291      0.122 -2.39   0.0168 5.9 -0.53 -0.0526
> 
> Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted 
> Type:  probability
# Predictions also work
predictions(p)
> 
>  Estimate Std. Error     z Pr(>|z|)     S 2.5 % 97.5 %
>     1.768      0.182  9.73   <0.001  71.9 1.412   2.12
>     0.866      0.120  7.19   <0.001  40.5 0.630   1.10
>     1.030      0.108  9.51   <0.001  68.8 0.817   1.24
>     1.486      0.120 12.36   <0.001 114.1 1.250   1.72
>     1.435      0.113 12.68   <0.001 120.0 1.214   1.66
> --- 118 rows omitted. See ?avg_predictions and ?print.marginaleffects --- 
>     1.324      0.103 12.84   <0.001 122.9 1.122   1.53
>     1.332      0.104 12.86   <0.001 123.2 1.129   1.53
>     1.149      0.102 11.27   <0.001  95.5 0.949   1.35
>     1.604      0.142 11.30   <0.001  96.0 1.326   1.88
>     0.954      0.114  8.38   <0.001  54.1 0.731   1.18
> Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, goals, difference 
> Type:  mean

@vincentarelbundock
Copy link
Owner Author

@zeileis I have not tried too many things yet, but at first glance what we discussed seems to work well. See above for installation and examples.

@zeileis
Copy link

zeileis commented Jul 8, 2024

Vincent, really cool, thanks!

I did not try many things, yet, but noticed the following. While avg_slopes() gives me the same output for the original Poisson glm object m and the promodel object p, I get somewhat different uncertainty results for the predictions().

Also, the type is a bit more flexible in procast() than I currently brought out in the main help file. We allow: "quantile", "mean", "variance", "probability", "cdf", "density", "pdf", "pmf", "loglikelihood", "log_pdf", "distribution", "parameters", "kurtosis", "skewness".

Thus, instead of "probability" (for compatibility with the p functions) you can also use "cdf" (for compatibility with distributions3). Similarly for "density" vs. "pdf" and "pmf" - and "loglikelihood" vs. "log_pdf". I haven't, yet decided which to push as the "preferred" set of type labels. (Thoughts and opinions are welcome...)

More tomorrow!

@vincentarelbundock
Copy link
Owner Author

TODO:

  • Agree on a final list of type to allow
  • Do SEs match in predictions when looking at topmodels vs raw model objects?

@vincentarelbundock
Copy link
Owner Author

vincentarelbundock commented Jul 9, 2024

I haven't, yet decided which to push as the "preferred" set of type labels. (Thoughts and opinions are welcome...)

In general, I like verbosity, but in this particular case I feel that it might be best to be consistent with distributions3. Sure, there is a parallel between the first letters of "pnorm" and "probability", but the parallel is not that obvious. So it's better to be consistent with something than to invent new types altogether. But that's a weak preference. What matters most is that you stick with one convention and avoid duplication.

A lot of users are asking me to implement prediction intervals in marginaleffects. So far, that's only been possible in Bayesian models or using Conformal Prediction with the inferences() function.

Is this something that topmodels can help me do reliably and flexibly for many model types? I'm thinking of something like this, which appears to yield adequate 90% coverage in an ultra simple normal model:

library("topmodels")
library("marginaleffects")

sim <- function(...) {
    N <- 500
    dat <- data.frame(x = rnorm(N))
    dat$y <- dat$x + rnorm(N)
    idx <- sample(1:N, N / 2)
    train <- dat[idx, ]
    test <- dat[-idx, ]
    m <- glm(y ~ x, data = train)
    p <- do.call(cbind, list(
        procast(m, newdata = test, type = "mean"),
        procast(m, newdata = test, type = "quantile", at = .05),
        procast(m, newdata = test, type = "quantile", at = .95)
    ))
    p <- setNames(p, c("estimate", "conf.low", "conf.high"))
    p$truth <- test$y
    coverage <- mean(p$truth < p$conf.high & p$truth > p$conf.low)
    return(coverage)
}

mean(sapply(seq_len(100), sim))
> [1] 0.89572

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants