Skip to content

Conversation

@vishpillai123
Copy link
Collaborator

@vishpillai123 vishpillai123 commented Dec 16, 2025

GLM architectures in H2O (aka "Generalized Linear Models") have a number of differences compared to our other architectures DRT/XRT, GBM, and XGBoost (all of which are tree based models). These architectures struggle with unseen categories in enum variables ("term_program_of_study"), for example.

The best way to manage this is to impute categorical variables prior to running SHAP, otherwise we run into hard errors in H2O. I built a helper for GLM case that imputes with mode. We also log all of this prior to running SHAP.

We really should move away from high cardinality categoricals, as that's what creates the most risk. Will work on this separately.

image

@vishpillai123 vishpillai123 changed the title feat: h2o glm unseen categories feat: handling unseen categories in glm h2o architectures Dec 16, 2025
@vishpillai123 vishpillai123 marked this pull request as ready for review December 16, 2025 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants