How to use DifficultyEstimator with large datasets? #38

EtienneT · 2025-03-06T18:31:58Z

EtienneT
Mar 6, 2025

I am working with a very large dataset (shape of 32_128_560, 42). I want to leverage the DifficultyEstimator, but it is using sklearn NearestNeighbors which is very slow.

I tried leveraging cuML NearestNeighbors which runs on the GPU. I simply made a copy of the code for DifficultyEstimator and replaced the import for NearestNeighbors by the one from cuML and it works flawlessly on the GPU.

But still, even DifficultyEstimatorGPU would fit the data very fast, but getting the scores is still very long and would not return after 10 minutes for the eval set (~2 million rows)...

My code looks like this:

import xgboost as xgb
from crepes import WrapRegressor
from DifficultyEstimator import DifficultyEstimator # my custom DifficultyEstimator with GPU support
from crepes.extras import MondrianCategorizer

rf = WrapRegressor(
    xgb.XGBRegressor(
        device="cuda",
        n_estimators=10,
        enable_categorical=True,
    )
)
rf.fit(X_train, y_train)

de = DifficultyEstimator()
de.fit(X_train, y=y_train)

mc_diff = MondrianCategorizer()
mc_diff.fit(X_eval, de=de, no_bins=20)

rf.calibrate(X_eval, y_eval, mc=mc_diff)

I guess my question would be, how else could I accelerate this? My ultimate goal would be to be able to use conformal predictive system, but I need to find a solution to make it faster. I am open at looking at implementing other difficulty estimator that could potentially be faster.

Thanks,

henrikbostrom · 2025-04-03T09:22:36Z

henrikbostrom
Apr 3, 2025
Maintainer

Hi,

Thanks for the question! I think a more scalable approach would be to generate a (fast) model for predicting the difficulty, e.g., as quoted from the COPA 2024 tutorial (see notebook in the docs folder), here using the out-of-bag predictions of a random forest, but in your case you would need to set aside part of the training data for this purpose:

diff_y = np.abs(y_prop_train-rf.learner.oob_prediction_)
diff_model = RandomForestRegressor(n_jobs=-1, n_estimators=500)
diff_model.fit(X_prop_train, diff_y)

de_mod = DifficultyEstimator()
de_mod.fit(X=X_prop_train,
           f=diff_model.predict, 
           scaler=True,
           beta=beta)
de_mod.apply(X_test)

Best regards,
Henrik

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use DifficultyEstimator with large datasets? #38

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to use DifficultyEstimator with large datasets? #38

Uh oh!

EtienneT Mar 6, 2025

Replies: 1 comment

Uh oh!

henrikbostrom Apr 3, 2025 Maintainer

EtienneT
Mar 6, 2025

henrikbostrom
Apr 3, 2025
Maintainer