Fix model ineffeciency#54
Conversation
On each request, all of the thousands strings in categories are processed. XGBoost does not store this in its memory, thus it is very inneficient, LightGBM can deal with that. However this should be rather handled on the side of the tool
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses model inefficiency by overhauling the underlying model architecture and prediction mechanisms. It introduces a more streamlined approach for managing machine learning models, enhancing both training and inference processes. The changes aim to provide better control over model complexity, improve prediction speed, and offer new tools for developer interaction, ultimately leading to more robust and efficient model deployment. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and valuable refactoring to improve model efficiency and reduce dependencies. Key changes include removing the scikit-learn TransformedTargetRegressor wrapper in favor of direct handling of target transformations, and implementing an optimized prediction path for XGBoost that bypasses pandas DataFrame overhead. The introduction of model size penalization in hyperparameter tuning is also a great addition for creating more efficient models. I've identified a critical issue with a user-specific file being added to the repository and a couple of medium-severity issues related to model size calculation and minor code cleanup. Overall, this is a high-quality contribution that improves the performance and maintainability of the model handling code.
| { | ||
| "folders": [ | ||
| { | ||
| "path": ".." | ||
| }, | ||
| { | ||
| "path": "../../copr" | ||
| }, | ||
| { | ||
| "path": "../../../../pagure/ansible" | ||
| }, | ||
| { | ||
| "path": "../../resalloc" | ||
| } | ||
| ], | ||
| "settings": {} | ||
| } No newline at end of file |
There was a problem hiding this comment.
This VSCode workspace file appears to be user-specific and should not be part of the repository. It contains relative paths like ../../copr which are unlikely to be valid for other developers. Please remove this file from the pull request and add *.code-workspace to the .gitignore file to prevent this from happening in the future.
| def _log_model_size(self, regressor: Any) -> None: | ||
| size_bytes = sys.getsizeof(regressor) | ||
| logger.info("Model size: %.2f MB", size_bytes / (1024 * 1024)) |
There was a problem hiding this comment.
The use of sys.getsizeof(regressor) to determine the model size is likely inaccurate. For models from libraries like XGBoost or LightGBM, this function often returns only the size of the Python wrapper object, not the memory occupied by the actual model in the underlying C++ implementation. This can be misleading. A more accurate approach would be to measure the size of the model file on disk, which is saved just before this method is called.
| def _log_model_size(self, regressor: Any) -> None: | |
| size_bytes = sys.getsizeof(regressor) | |
| logger.info("Model size: %.2f MB", size_bytes / (1024 * 1024)) | |
| def _log_model_size(self, regressor: Any) -> None: | |
| model_file = self._model_directory / self.native_model_filename | |
| size_bytes = model_file.stat().st_size | |
| logger.info("Model size on disk: %.2f MB", size_bytes / (1024 * 1024)) |
| data[:, i] = col.map( | ||
| self._cat_encoders[feat], | ||
| ).to_numpy(dtype=np.float32) | ||
| else: | ||
| data[:, i] = col.to_numpy(dtype=np.float32) | ||
|
|
||
| dmatrix = self.xgb.DMatrix( | ||
| data, | ||
| feature_names=list(ALL_FEATURES), | ||
| feature_types=self._feature_types, |
There was a problem hiding this comment.
There are a couple of minor code style issues in the predict method of XGBoostModel that could be cleaned up for better clarity and to remove minor inefficiencies:
- The trailing comma in
col.map(self._cat_encoders[feat],)is unnecessary. - The
list()call infeature_names=list(ALL_FEATURES)is redundant, asALL_FEATURESis already a list.
| data[:, i] = col.map( | |
| self._cat_encoders[feat], | |
| ).to_numpy(dtype=np.float32) | |
| else: | |
| data[:, i] = col.to_numpy(dtype=np.float32) | |
| dmatrix = self.xgb.DMatrix( | |
| data, | |
| feature_names=list(ALL_FEATURES), | |
| feature_types=self._feature_types, | |
| data[:, i] = col.map( | |
| self._cat_encoders[feat] | |
| ).to_numpy(dtype=np.float32) | |
| else: | |
| data[:, i] = col.to_numpy(dtype=np.float32) | |
| dmatrix = self.xgb.DMatrix( | |
| data, | |
| feature_names=ALL_FEATURES, | |
| feature_types=self._feature_types, |
|
whoah this is still proof of concept but from 8GB RAM of service to 1GB, from 2 requests per second to 164 requests per second! this is now really lightweight and all of that from couple of lines of code :) (the rest is basically just removing unneeded features anymore + adding nice debugging feature) |
from 95% memcpy overhead to basically 0 so the problem is totally gone. Yes there is some 16% python overhead but that is expected. The rest is mainly nginx or model related work, exactly what we want |
86fb31d to
056f74a
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and valuable refactoring of the model handling, training, and prediction logic to improve efficiency. Key improvements include removing the dependency on scikit-learn's TransformedTargetRegressor, optimizing the prediction path (especially for XGBoost), and introducing a model size penalty in hyperparameter tuning. The code quality is high, and the changes are consistent across the project. I've identified one high-severity issue in the training logic when early stopping is used, which could lead to suboptimal models. Please see the detailed comment.
| best_regressor = self.make_regressor(study.best_trial.params) | ||
| best_regressor.fit(X_train, y_train_t) |
There was a problem hiding this comment.
When refitting the best model after an Optuna study with early stopping enabled, the _...EarlyStopping wrapper class is used. This class's fit method splits the provided training data into a new training and validation set. Consequently, the final model is trained on only a subset of the X_train data, which is suboptimal. The final model should be trained on the entire training dataset (X_train, y_train_t) to maximize its performance.
A common practice is to:
- During the Optuna trial, capture the optimal number of boosting rounds (
best_iteration) from the regressor when early stopping is triggered. This can be stored as a user attribute on the trial. - For the final refit, create a new instance of the base regressor (without the early stopping wrapper).
- Set its
n_estimatorsparameter to the capturedbest_iteration. - Fit this new regressor on the complete
X_trainandy_train_tdatasets.
This would require some refactoring to separate the creation of the 'tuning' regressor from the 'final' regressor.
Fix #45