Fix model ineffeciency by nikromen · Pull Request #54 · fedora-copr/rpmeta

nikromen · 2026-03-03T22:04:39Z

On each request, all of the thousands strings in categories are processed. XGBoost does not store this in its memory, thus it is very inneficient, LightGBM can deal with that. However this should be rather handled on the side of the tool

gemini-code-assist · 2026-03-03T22:05:13Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses model inefficiency by overhauling the underlying model architecture and prediction mechanisms. It introduces a more streamlined approach for managing machine learning models, enhancing both training and inference processes. The changes aim to provide better control over model complexity, improve prediction speed, and offer new tools for developer interaction, ultimately leading to more robust and efficient model deployment.

Highlights

Model Refactoring and Optimization: The core model handling and prediction pipeline has been significantly refactored, moving away from sklearn.compose.TransformedTargetRegressor to a more direct and optimized approach for native model saving, loading, and prediction, particularly for XGBoost and LightGBM.
Model Size Penalization: New configuration options (size_penalty_enabled, size_penalty_lambda) have been introduced for both XGBoost and LightGBM models, allowing for penalization of model complexity during training to prevent overfitting and improve efficiency.
Interactive Shell Command: A new shell command has been added to the CLI, enabling developers to start an interactive Python session with the loaded model and configuration for easier debugging and exploration.
XGBoost Parameter Update: XGBoost configuration now uses max_bin and max_leaves instead of max_depth, providing finer control over tree structure and potentially improving model performance and efficiency.
Prediction Performance Improvements: Optimized prediction methods have been implemented for XGBoost, bypassing expensive Python-side categorical string serialization by encoding features to integer codes and constructing DMatrix directly.

Changelog

files/config.toml.example
- Replaced max_depth with max_bin and max_leaves for XGBoost configuration.
- Added size_penalty_enabled and size_penalty_lambda parameters for both XGBoost and LightGBM models.
- Removed explicit objective and tree_method from model parameters.
rpmeta/cli/run.py
- Added code and sys imports.
- Introduced a new shell command for interactive Python sessions with the loaded model.
rpmeta/config.py
- Updated XGBoostParams to replace max_depth with max_bin and max_leaves fields.
- Added size_penalty_enabled and size_penalty_lambda fields to XGBoostParams and LightGBMParams.
rpmeta/dataset.py
- Modified InputRecord.to_data_frame to accept an optional category_dtypes argument for pre-created categorical dtypes.
rpmeta/model.py
- Removed joblib and sklearn.compose.TransformedTargetRegressor imports.
- Added TARGET_FUNC and INVERSE_FUNC as class attributes to the Model abstract base class.
- Replaced the abstract method _make_regressor with make_regressor.
- Introduced a new abstract method compute_size_penalty to the Model class.
- Simplified save_regressor and load_regressor methods to handle only the native model.
- Added prepare_for_prediction and predict methods to the base Model class for a unified prediction interface.
- Updated XGBoostModel.make_regressor to set max_depth=0 and grow_policy='lossguide'.
- Implemented compute_size_penalty and an optimized predict method for XGBoostModel.
- Implemented compute_size_penalty for LightGBMModel.
rpmeta/predictor.py
- Updated Predictor's __init__ to accept a Model instance and store pre-created _category_dtypes.
- Modified Predictor.load to call model.prepare_for_prediction after loading.
- Updated Predictor.predict to pass category_dtypes to input_data.to_data_frame.
- Changed model_name access in health_check to predictor.model.name.
rpmeta/rpmeta.code-workspace
- Added a new VS Code workspace configuration file.
rpmeta/server/api.py
- Updated the health check endpoint to retrieve the model name from predictor.model.name.
rpmeta/trainer/base.py
- Added sys, collections.abc.Callable, and numpy imports.
- Introduced a new _TransformedPredictor class for visualizer compatibility.
- Added a _log_model_size method to log the size of the trained model.
- Modified the run_study objective to apply TARGET_FUNC to y_train, use INVERSE_FUNC for predictions, and incorporate compute_size_penalty into the optimization objective.
- Updated BestModelResult to store _TransformedPredictor instances.
- Modified the run method to apply TARGET_FUNC to y and call _log_model_size.
rpmeta/trainer/models.py
- Updated XGBoostTrainer.param_space and default_params to use max_leaves and max_bin instead of max_depth.
test/conftest.py
- Updated the base_model_subclass fixture to implement make_regressor and compute_size_penalty methods.
test/unit/test_model_persistence.py
- Added a _category_maps_from_df helper function.
- Updated LightGBM and XGBoost persistence tests to align with the new Model interface, including make_regressor, TARGET_FUNC, INVERSE_FUNC, load_regressor, prepare_for_prediction, and predict.
- Replaced test_transformed_target_regressor_preserved with test_target_transform_functions to verify target transformations.
test/unit/test_predictor.py
- Updated test_predictor_load to mock XGBoostModel methods and verify calls to load_regressor and prepare_for_prediction.
- Adjusted Predictor instantiation in tests to remove the model_name argument.

Activity

The pull request was opened by nikromen to address issue predicting on xgboost is too inefficient #45, which likely pertains to model inefficiency.
Significant refactoring was performed across multiple files, indicating a substantial change in the model's architecture and operational flow.
New configuration parameters were introduced, suggesting an expansion of tunable model properties.
New developer tooling, such as an interactive shell, was added to improve the development and debugging experience.
Extensive updates to unit tests were made to ensure the correctness and integrity of the refactored model persistence and prediction logic.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant and valuable refactoring to improve model efficiency and reduce dependencies. Key changes include removing the scikit-learn TransformedTargetRegressor wrapper in favor of direct handling of target transformations, and implementing an optimized prediction path for XGBoost that bypasses pandas DataFrame overhead. The introduction of model size penalization in hyperparameter tuning is also a great addition for creating more efficient models. I've identified a critical issue with a user-specific file being added to the repository and a couple of medium-severity issues related to model size calculation and minor code cleanup. Overall, this is a high-quality contribution that improves the performance and maintainability of the model handling code.

gemini-code-assist · 2026-03-03T22:07:00Z

+{
+	"folders": [
+		{
+			"path": ".."
+		},
+		{
+			"path": "../../copr"
+		},
+		{
+			"path": "../../../../pagure/ansible"
+		},
+		{
+			"path": "../../resalloc"
+		}
+	],
+	"settings": {}
+}


This VSCode workspace file appears to be user-specific and should not be part of the repository. It contains relative paths like ../../copr which are unlikely to be valid for other developers. Please remove this file from the pull request and add *.code-workspace to the .gitignore file to prevent this from happening in the future.

gemini-code-assist · 2026-03-03T22:07:00Z

+    def _log_model_size(self, regressor: Any) -> None:
+        size_bytes = sys.getsizeof(regressor)
+        logger.info("Model size: %.2f MB", size_bytes / (1024 * 1024))


The use of sys.getsizeof(regressor) to determine the model size is likely inaccurate. For models from libraries like XGBoost or LightGBM, this function often returns only the size of the Python wrapper object, not the memory occupied by the actual model in the underlying C++ implementation. This can be misleading. A more accurate approach would be to measure the size of the model file on disk, which is saved just before this method is called.

Suggested change

def _log_model_size(self, regressor: Any) -> None:

size_bytes = sys.getsizeof(regressor)

logger.info("Model size: %.2f MB", size_bytes / (1024 * 1024))

def _log_model_size(self, regressor: Any) -> None:

model_file = self._model_directory / self.native_model_filename

size_bytes = model_file.stat().st_size

logger.info("Model size on disk: %.2f MB", size_bytes / (1024 * 1024))

gemini-code-assist · 2026-03-03T22:07:00Z

+                    data[:, i] = col.map(
+                        self._cat_encoders[feat],
+                    ).to_numpy(dtype=np.float32)
+            else:
+                data[:, i] = col.to_numpy(dtype=np.float32)
+
+        dmatrix = self.xgb.DMatrix(
+            data,
+            feature_names=list(ALL_FEATURES),
+            feature_types=self._feature_types,


There are a couple of minor code style issues in the predict method of XGBoostModel that could be cleaned up for better clarity and to remove minor inefficiencies:

The trailing comma in col.map(self._cat_encoders[feat],) is unnecessary.

The list() call in feature_names=list(ALL_FEATURES) is redundant, as ALL_FEATURES is already a list.

Suggested change

data[:, i] = col.map(

self._cat_encoders[feat],

).to_numpy(dtype=np.float32)

else:

data[:, i] = col.to_numpy(dtype=np.float32)

dmatrix = self.xgb.DMatrix(

data,

feature_names=list(ALL_FEATURES),

feature_types=self._feature_types,

data[:, i] = col.map(

self._cat_encoders[feat]

).to_numpy(dtype=np.float32)

else:

data[:, i] = col.to_numpy(dtype=np.float32)

dmatrix = self.xgb.DMatrix(

data,

feature_names=ALL_FEATURES,

feature_types=self._feature_types,

nikromen · 2026-03-03T22:22:26Z

whoah this is still proof of concept but from 8GB RAM of service to 1GB, from 2 requests per second to 164 requests per second! this is now really lightweight

and all of that from couple of lines of code :) (the rest is basically just removing unneeded features anymore + adding nice debugging feature)

nikromen · 2026-03-03T22:28:15Z

# Overhead  Command          Shared Object                                      Symbol                                                                                                                                                                                                     >
# ........  ...............  .................................................  ...........................................................................................................................................................................................................>
#
    16.97%  AnyIO worker th  libpython3.14.so.1.0                               [.] _PyEval_EvalFrameDefault
     7.74%  AnyIO worker th  libxgboost.so                                      [.] xgboost::RegTree::MaxDepth(int) const
     4.25%  AnyIO worker th  libxgboost.so                                      [.] xgboost::common::Decision(xgboost::common::Span<unsigned int const, 18446744073709551615ul>, float)
     3.69%  nginx            libcrypto.so.3.5.4                                 [.] ossl_rsaz_amm52x20_x2_ifma256
     2.95%  AnyIO worker th  libxgboost.so                                      [.] float xgboost::predictor::scalar::PredValueByOneTree<true>(xgboost::RegTree::FVec const&, xgboost::RegTree const&, xgboost::RegTree::CategoricalSplitMatrix const&, int)
     2.89%  rpmeta           libpython3.14.so.1.0                               [.] _PyEval_EvalFrameDefault
     2.15%  AnyIO worker th  libpython3.14.so.1.0                               [.] _PyObject_GenericGetAttrWithDict
     1.81%  AnyIO worker th  libpython3.14.so.1.0                               [.] list_contains.lto_priv.0
     1.72%  AnyIO worker th  ld-linux-x86-64.so.2                               [.] _dl_tlsdesc_return
     1.56%  AnyIO worker th  libpython3.14.so.1.0                               [.] PyUnicode_RichCompare
     1.41%  AnyIO worker th  libpython3.14.so.1.0                               [.] _PyEval_FrameClearAndPop
     1.38%  AnyIO worker th  libpython3.14.so.1.0                               [.] _Py_dict_lookup
     1.37%  AnyIO worker th  libxgboost.so                                      [.] xgboost::RegTree::MaxDepth(int) const [clone .constprop.0]
     1.04%  AnyIO worker th  [kernel.kallsyms]                                  [k] asm_sysvec_apic_timer_interrupt
     0.87%  AnyIO worker th  libpython3.14.so.1.0                               [.] initialize_locals.lto_priv.0
     0.77%  nginx            [kernel.kallsyms]                                  [k] __irqentry_text_start
     0.72%  AnyIO worker th  libpython3.14.so.1.0                               [.] _PyType_AllocNoTrack
     0.64%  AnyIO worker th  libpython3.14.so.1.0                               [.] tuple_dealloc.lto_priv.0
     0.64%  AnyIO worker th  libpython3.14.so.1.0                               [.] PyTraceMalloc_Untrack
     0.63%  AnyIO worker th  [kernel.kallsyms]                                  [k] __irqentry_text_start
     0.62%  AnyIO worker th  libpython3.14.so.1.0                               [.] _PyObject_MakeTpCall
     0.53%  AnyIO worker th  libpython3.14.so.1.0                               [.] getset_get.lto_priv.0
     0.50%  AnyIO worker th  libpython3.14.so.1.0                               [.] _Py_Dealloc
     0.44%  AnyIO worker th  libpython3.14.so.1.0                               [.] PyTraceMalloc_Track
     0.43%  AnyIO worker th  libpython3.14.so.1.0                               [.] PyFunction_NewWithQualName
     0.37%  AnyIO worker th  libpython3.14.so.1.0                               [.] PyObject_Vectorcall
     0.37%  AnyIO worker th  libpython3.14.so.1.0                               [.] PyObject_GetAttr
     0.37%  nginx            libc.so.6                                          [.] _int_malloc
     0.36%  AnyIO worker th  libpython3.14.so.1.0                               [.] PyObject_GC_Del
     0.34%  AnyIO worker th  libpython3.14.so.1.0                               [.] list_dealloc.lto_priv.0
     0.32%  AnyIO worker th  libpython3.14.so.1.0                               [.] dict_dealloc.lto_priv.0
     0.32%  AnyIO worker th  libpython3.14.so.1.0                               [.] object_isinstance.lto_priv.0
     0.32%  nginx            libcrypto.so.3.5.4                                 [.] scalar_inverse_ntt
     0.31%  AnyIO worker th  libpython3.14.so.1.0                               [.] subtype_dealloc.lto_priv.0
     0.30%  AnyIO worker th  libpython3.14.so.1.0                               [.] _PyTuple_FromStackRefStealOnSuccess
     0.30%  nginx            libcrypto.so.3.5.4                                 [.] sha512_block_data_order_avx2
     0.30%  AnyIO worker th  libpython3.14.so.1.0                               [.] PyDict_GetItemRef
     0.27%  AnyIO worker th  libc.so.6                                          [.] pthread_mutex_lock@@GLIBC_2.2.5
     0.27%  AnyIO worker th  libc.so.6                                          [.] __memcmp_evex_movbe
     0.27%  AnyIO worker th  libpython3.14.so.1.0                               [.] unicode_dealloc.lto_priv.0
     0.27%  AnyIO worker th  libpython3.14.so.1.0                               [.] PyType_IsSubtype
     0.26%  AnyIO worker th  libpython3.14.so.1.0                               [.] unicode_from_format.lto_priv.0
     0.24%  AnyIO worker th  libpython3.14.so.1.0                               [.] _PyFunction_Vectorcall
     0.21%  nginx            libcrypto.so.3.5.4                                 [.] ossl_extract_multiplier_2x20_win5
     0.21%  rpmeta           libpython3.14.so.1.0                               [.] subtype_dealloc.lto_priv.0
     0.21%  rpmeta           libpython3.14.so.1.0                               [.] _PyEval_FrameClearAndPop
     0.21%  AnyIO worker th  libpython3.14.so.1.0                               [.] get_exception_handler.lto_priv.0
     0.21%  nginx            libcrypto.so.3.5.4                                 [.] cmov
     0.21%  AnyIO worker th  libpython3.14.so.1.0                               [.] _Py_type_getattro
     0.21%  AnyIO worker th  libpython3.14.so.1.0                               [.] _PyUnicode_JoinArray
     0.21%  AnyIO worker th  libpython3.14.so.1.0                               [.] wrapperdescr_call.lto_priv.0

from 95% memcpy overhead to basically 0 so the problem is totally gone. Yes there is some 16% python overhead but that is expected. The rest is mainly nginx or model related work, exactly what we want

…ommand

nikromen · 2026-03-04T12:04:43Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant and valuable refactoring of the model handling, training, and prediction logic to improve efficiency. Key improvements include removing the dependency on scikit-learn's TransformedTargetRegressor, optimizing the prediction path (especially for XGBoost), and introducing a model size penalty in hyperparameter tuning. The code quality is high, and the changes are consistent across the project. I've identified one high-severity issue in the training logic when early stopping is used, which could lead to suboptimal models. Please see the detailed comment.

gemini-code-assist · 2026-03-04T12:07:17Z

+        best_regressor = self.make_regressor(study.best_trial.params)
+        best_regressor.fit(X_train, y_train_t)


When refitting the best model after an Optuna study with early stopping enabled, the _...EarlyStopping wrapper class is used. This class's fit method splits the provided training data into a new training and validation set. Consequently, the final model is trained on only a subset of the X_train data, which is suboptimal. The final model should be trained on the entire training dataset (X_train, y_train_t) to maximize its performance.

A common practice is to:

During the Optuna trial, capture the optimal number of boosting rounds (best_iteration) from the regressor when early stopping is triggered. This can be stored as a user attribute on the trial.

For the final refit, create a new instance of the base regressor (without the early stopping wrapper).

Set its n_estimators parameter to the captured best_iteration.

Fit this new regressor on the complete X_train and y_train_t datasets.

This would require some refactoring to separate the creation of the 'tuning' regressor from the 'final' regressor.

gemini-code-assist Bot reviewed Mar 3, 2026

View reviewed changes

feat: penalize models for overcommitting in size + nice shell debug c…

056f74a

…ommand

nikromen force-pushed the fix-model-ineffeciency branch from 86fb31d to 056f74a Compare March 4, 2026 12:01

gitignore: add vscode workspaces

b4c7c8a

update config

c09c608

gemini-code-assist Bot reviewed Mar 4, 2026

View reviewed changes

nikromen merged commit 2d28644 into fedora-copr:main Mar 4, 2026
6 of 8 checks passed

nikromen mentioned this pull request Mar 4, 2026

Early stopping discards 20% of data for final training #55

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix model ineffeciency#54

Fix model ineffeciency#54
nikromen merged 4 commits intofedora-copr:mainfrom
nikromen:fix-model-ineffeciency

nikromen commented Mar 3, 2026

Uh oh!

gemini-code-assist Bot commented Mar 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 3, 2026

Uh oh!

gemini-code-assist Bot Mar 3, 2026

Uh oh!

gemini-code-assist Bot Mar 3, 2026

Uh oh!

nikromen commented Mar 3, 2026 •

edited

Loading

Uh oh!

nikromen commented Mar 3, 2026 •

edited

Loading

Uh oh!

nikromen commented Mar 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		best_regressor = self.make_regressor(study.best_trial.params)
		best_regressor.fit(X_train, y_train_t)

Conversation

nikromen commented Mar 3, 2026

Uh oh!

gemini-code-assist Bot commented Mar 3, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

nikromen commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikromen commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikromen commented Mar 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nikromen commented Mar 3, 2026 •

edited

Loading

nikromen commented Mar 3, 2026 •

edited

Loading