All notable changes to this project will be documented in this file.
- Fixed
predict_modelthrowing an exception with loaded pipelines (pycaret#2349) - Fixed potential parameter leaking for
ParallelBackend- thanks to @goodwanghan (pycaret#2339) - Refactored a piece of logic in arules - thanks to @daikikatsuragawa (pycaret#2316)
- Added Two Tutorials in Chinese - thanks to @ryanxjhan (pycaret#2352)
- Added CLF101 in Chinese - thanks to @ryanxjhan (pycaret#2353)
- Added new tutorials in Chinese - thanks to @ryanxjhan (pycaret#2375)
- Made
log_experimentmore configurable (pycaret#2334, pycaret#2335) - Made
return_train_score=Falseuse the old output format (pycaret#2333)
- Fixed
dashboard_loggerkey error duringsetup(pycaret#2311)
- Fugue integration - thanks to @goodwanghan (pycaret#2035)
- Added W&B experiment logger - thanks to @AyushExel (pycaret#2231)
- Fixed
check_fairnessexception when index is not and ordinal number - thanks to @reza1615 (pycaret#2055) - Unsupported characters in dataframes are now replaced - thanks to @reza1615 (pycaret#2058)
- Fixed drift report with categorical columns - thanks to @reza1615 (pycaret#2063)
- Added multivariable time series dataset from UCI - thanks to @reza1615 (pycaret#2094)
- Fixed a UTF error during installation - thanks to @reza1615 (pycaret#2113)
- MLFlow tracking API can now take in custom tags - thanks to @netoferraz (pycaret#1526)
- Updated
create_apifunction (pycaret#2146) drift_reportcan now work with unseen data - thanks to @reza1615 (pycaret#2183)- Added Japanese tutorial - thanks to @hanaseleb (pycaret#2215)
- Added Traffic and Drugs Related Violations dataset and example - thanks to @HaithemH (pycaret#2191)
- Train score can now be returned from various supervised learning functions (
return_train_score=True). Passing an unseen dataset with the label column topredict_modelwill now calculate the metrics for that dataset - thanks to @levelalphaone (pycaret#2237) - Fixed spelling mistakes in function docstrings - thanks to @aadarshsingh191198 (pycaret#2269)
- Pinned
numba<0.55(pycaret#2056)
- Added new function
create_app(pycaret#2044) - Refactored
optimize_thresholdfunction (pycaret#2041) - Added new function
create_docker(pycaret#2005) - Added new function
create_api(pycaret#2000) - Added new function
check_fairness(pycaret#1997) - Added new function
eda(pycaret#1983) - Added new function
convert_model(pycaret#1959) - Added an ability to pass kwargs to plots in
plot_model(https://github.com/pycaret/pycaret/pull/19400) - Added
drift_reportfunctionality topredict_model(pycaret#1935) - Added new function
create_dashboard(pycaret#1925) - Added
grid_intervalparameter tooptimize_threshold- thanks to @wolfryu (pycaret#1938) - Made logging level configurable by environment variable (pycaret#2026)
- Made the optional path in AWS configurable (pycaret#2045)
- Fixed TSNE plot with PCA (pycaret#2032)
- Fixed rendering of streamlit plots (pycaret#2008)
- Fixed class names in
treeplot - thanks to @yamasakih (pycaret#1982) - Fixed NearZeroVariance preprocessor not being configurable - thanks to @Flyfoxs (pycaret#1952)
- Removed duplicated code - thanks to @Flyfoxs (pycaret#1882)
- Documentation improvements - thanks to @harsh204016, @khrapovs (https://github.com/pycaret/pycaret/pull/1931/files, pycaret#1956, pycaret#1946, pycaret#1949)
- Pinned
pyyaml<6.0.0to fix issues with Google Colab
- Fixed an issue where
Fix_multicollinearitywould fail if the target was a float (pycaret#1640) - MLFlow runs are now nested - thanks to @jfagn (pycaret#1660)
- Fixed a typo in REG102 tutorial - thanks to @bobo-jamson (pycaret#1684)
- Fixed
interpret_modelnot always respectingsave_path(pycaret#1707) - Fixed certain plots not being logged by MLFlow (pycaret#1769)
- Added dummy models to set a baseline in
compare_models- thanks to @reza1615 (pycaret#1739) - Improved error message if a column specified in
ignore_featuresdoesn't exist in the dataset - thanks to @reza1615 (pycaret#1793) - Added an ability to set a custom probability threshold for binary classification through the
probability_thresholdargument in various methods (pycaret#1858) - Separated internal CV from validation CV for
stack_modelsandcalibrate_models(pycaret#1849, pycaret#1858) - A
RuntimeErrorwill now be raised if an incorrect version ofscikit-learnis installed (pycaret#1870) - Improved readme, documentation and repository structure
- Unpinned
numba(pycaret#1735)
- Added
get_leaderboardfunction for classification and regression modules - It is now possible to specify the plot save path with the save argument of
plot_modelandinterpret_model- thanks to @bhanuteja2001 (pycaret#1537) - Fixed
interpret_modelaffectingplot_modelbehavior - thanks to @naujgf (pycaret#1600) - Fixed issues with conda builds - thanks to @melonhead901 (pycaret#1479)
- Documentation improvements - thanks to @caron14 and @harsh204016 (pycaret#1499, pycaret#1502)
- Fixed
blend_modelsandstack_modelsthrowing an exception when using custom estimators (pycaret#1500) - Fixed a "Target Missing" issue with "Remove Multicolinearity" option (pycaret#1508)
errors="ignore"parameter forcompare_modelsnow correctly ignores errors during full fit (pycaret#1510)- Fixed certain data types being incorrectly encoded as int64 during setup (pycaret#1515)
- Pinned
numba<0.54(pycaret#1530)
- Fixed issues with
[full]install by pinninginterpret<=0.2.4 - Added support for S3 folder path in
deploy_model()with AWS - Enabled experimental Optuna
TPESampleroptions to improve convergence (intune_model())
- Implemented PDP, MSA and PFI plots in
interpret_model- thanks to @IncubatorShokuhou (pycaret#1415) - Implemented Kolmogorov-Smirnov (KS) plot in
plot_modelunderpycaret.classificationmodule - Fixed a typo "RVF" to "RBF" - thanks to @baturayo (pycaret#1220)
- Readme & license updates and improvements
- Fixed
remove_multicollinearityconsidering categorical features - Fixed keyword issues with PyCaret's cuML wrappers
- Improved performance of iterative imputation
- Fixed
gainandliftplots taking wrong arguments, creating misleading plots interpret_modelon LightGBM will now show a beeswarm plot- Multiple improvements to exception handling and documentation in
pycaret.persistence(pycaret#1324) remove_perfect_collinearityoption will now be show in thesetup()summary - thanks to @mjkanji (pycaret#1342)- Fixed
IterativeImputersetting wrong float precision - Fixed custom grids in
tune_modelraising an exception when composed of lists - Improved documentation in
pycaret.clustering- thanks to @susmitpy (pycaret#1372) - Added support for LightGBM CUDA version - thanks to @IncubatorShokuhou (pycaret#1396)
- Exposed
addressinget_datafor alternative data sources - thanks to @IncubatorShokuhou (pycaret#1416)
- Fixed an exception with missing variables (display_container etc.) during load_config()
- Fixed exceptions when using Ridge and RF estimators with cuML (GPU mode)
- Fixed PyCaret's cuML wrappers not being pickleable
- Added an extra check to get_all_object_vars_and_properties internal method, fixing exceptions with certain estimators
- save_model() now supports kwargs, which will be passed to joblib.dump()
- Fixed an issue with load_model() from AWS (duplicate .pkl extension) - thanks to markgrujic (pycaret#1128)
- Fixed a typo in documentation - thanks to koorukuroo (pycaret#1149)
- Optimized Fix_multicollinearity transformer, drastically reducing the size of saved pipeline
- interpret_model() now supports data passed as an argument - thanks to jbechtel (pycaret#1184)
- Removed
infer_signaturefrom MLflow logging whenlog_experiment=True. - Fixed a rare issue where binary_multiclass_score_func was not pickleable
- Fixed edge case exceptions in feature selection
- Fixed an exception with
finalize_modelwhen using GroupKFold CV - Pinned
mlxtend>=0.17.0,imbalanced-learn==0.7.0, andgensim<4.0.0
- Modules Impacted:
pycaret.classificationpycaret.regressionpycaret.clusteringpycaret.anomalypycaret.arules
- Added new interactive residual plots in
pycaret.regressionmodule. You can now generate interactive residual plots by usingresiduals_interactivein theplot_modelfunction. - Added plot rendering support for streamlit applications. A new parameter
display_formatis added in theplot_modelfunction. To render plot in streamlit app, set this tostreamlit. - Revamped Boruta feature selection algorithm. (give it a try!).
tune_modelinpycaret.classificationandpycaret.regressionis now compatible with custom models.- Added low_memory and max_len support to association rules module (pycaret#1008).
- Increased robustness of DataFrame checks (pycaret#1005).
- Improved loading of models from AWS (pycaret#1005).
- Catboost and XGBoost are now optional dependencies. They are not automatically installed with default slim installation. To install optional dependencies use
pip install pycaret[full]. - Added
raw_scoreargument in thepredict_modelfunction forpycaret.classificationmodule. When set to True, scores for each class will be returned separately. - PyCaret now returns base scikit-learn objects, whenever possible.
- When
handle_unknown_categoricalis set to False in thesetupfunction, an exception will be raised during prediction if the data contains unknown levels in categorical features. predict_modelfor multiclass classification now returns labels as an integer.- Fixed an edge case where an IndexError would be raised in
pycaret.clusteringandpycaret.anomaly. - Fixed text formatting for certain plots in
pycaret.classificationandpycaret.regression. - If a
logs.logfile cannot be created whensetupis initialized, no exception will be raised now (support for more configurable logging to come in future). - User added metrics will not raise exceptions now and instead return 0.0.
- Compatibility with tune-sklearn>=0.2.0.
- Fixed an edge case for dropping NaNs in target column.
- Fixed stacked models not being tuned correctly.
- Fixed an exception with KFold when fold_shuffle=False.
Release: PyCaret 2.2.3 | Release Date: December 22, 2020 (SEVERAL BUGS FIX | CRITICAL COMPATIBILITY FIX)
- Fixed exceptions with the
predict_modelfunction when data columns had non-string characters. - Fixed a rare exception with the
remove_multicollinearityparameter in thesetupfunction`. - Improved performance and robustness of conversion of date features to categoricals.
- Fixed an exception with the
modelsfunction when thetypeparameter was passed. - The data frame displayed after setup can now be accessed with the
pullfunction. - Fixed an exception with save_config
- Fixed a rare case where the target column would be treated as an ID column and thus dropped.
- SHAP plots can now be saved (pass save parameter as True)
- | CRITICAL | Compatibility broke for catboost, pyod (other impacts unknown as of now) with sklearn=0.24 (released on Dec 22, 2020). A temporary fix is requiring 0.23.2 specifically in the
requirements.txt.
- Fixed an issue with the
optimize_thresholdfunction thepycaret.classificationmodule. It now returns a float instead of an array. - Fixed issue with the
predict_modelfunction. It now uses original data frame to append the predictions. As such any extra columns given at the time of inference are not removed when returning the predictions. Instead they are internally ignored at the time of predictions. - Fixed edge case exceptions for the
create_modelfunction inpycaret.clustering. - Fixed exceptions when column names are not string.
- Fixed exceptions in
pycaret.regressionwhentransform_targetis True in thesetupfunction. - Fixed an exception in the
modelsfunction if thetypeparameter is specified.
Post-release 2.2, the following issues have been fixed:
- Fixed
plot_model = 'tree'exceptions. - Fixed issue with
predict_modelcausing errors with non-contiguous indices. - Fixed issue with
remove_outliersparameter in thesetupfunction. It was introducing extra columns in training data. The issue has been fixed now. - Fixed issue with
plot_modelinpycaret.clusteringcausing errors with non-contiguous indices. - Fixed an exception when the model was saved or logged when
imputation_typeis set to 'iterative' in thesetupfunction. compare_modelsnow prints intermediate output whenhtml=False.- Metrics in
pycaret.classificationfor binary classification are now calculated withaverage='binary'. Before they were a weighted average of positive and negative class, now they are just calculated for positive class. For multiclass classificationaverage='weighted'. optimize_thresholdnow returns optimized probability threshold value as numpy object.- Fixed issue with certain exceptions in
compare_models. - Added
profile_kwargsargument in thesetupfunction to pass keyword arguments to Pandas Profiler. plot_model,interpret_model, andevaluate_modelnow accepts a new parameteruse_train_datawhich when set to True, generates plot on train data instead of test data.
-
Modules Impacted:
pycaret.classificationpycaret.regressionpycaret.clusteringpycaret.anomaly -
Separate Train and Test Set: New parameter
test_datahas been added in thesetupfunction ofpycaret.classificationandpycaret.regression. When a DataFrame is passed into thetest_data, it is used as a holdout set and thetrain_sizeparameter is ignored.test_datamust be labeled and the shape oftest_datamust match with the shape ofdata. -
Disable Default Preprocessing: A new parameter
preprocesshas been added into thesetupfunction. Whenpreprocessis set toFalse, no transformations are applied except fortrain_test_splitand custom transformations passed in thecustom_pipelineparam. Data must be ready for modeling (no missing values, no dates, categorical data encoding) when preprocess is set to False. -
Custom Metrics: New functions
get_metric,add_metricandremove_metricis now added inpycaret.classification,pycaret.regression, andpycaret.clustering, that can be used to add / remove metrics used in model evaluation. -
Custom Transformations: A new parameter
custom_pipelinehas been added into thesetupfunction. It takes a tuple of(str, transformer)or a list of tuples. When passed, it will append the custom transformers in the preprocessing pipeline and are applied on each CV fold separately and on the final fit. All the custom transformations are applied aftertrain_test_splitand before pycaret's internal transformations. -
GPU enabled Training: To use GPU for training
use_gpuparameter in thesetupfunction can be set toTrueorforce. When set to True, it will use GPU with algorithms that support it and fall back on CPU for remaining. When set toforceit will only use GPU-enabled algorithms and raise exceptions if they are unavailable for use. The following algorithms are supported on GPU:- Extreme Gradient Boosting
pycaret.classificationpycaret.regression - LightGBM
pycaret.classificationpycaret.regression - CatBoost
pycaret.classificationpycaret.regression - Random Forest
pycaret.classificationpycaret.regression - K-Nearest Neighbors
pycaret.classificationpycaret.regression - Support Vector Machine
pycaret.classificationpycaret.regression - Logistic Regression
pycaret.classification - Ridge Classifier
pycaret.classification - Linear Regression
pycaret.regression - Lasso Regression
pycaret.regression - Ridge Regression
pycaret.regression - Elastic Net (Regression)
pycaret.regression - K-Means
pycaret.clustering - Density-Based Spatial Clustering
pycaret.clustering
- Extreme Gradient Boosting
-
Hyperparameter Tuning: New methods for hyperparameter tuning has been added in the
tune_modelfunction forpycaret.classificationandpycaret.regression. New parametersearch_libraryandsearch_algorithmin thetune_modelfunction is added.search_librarycan bescikit-learn,scikit-optimize,tune-sklearn, andoptuna. Thesearch_algorithmparam can take the following values based on itssearch_library:- scikit-learn:
randomgrid - scikit-optimize:
bayesian - tune-sklearn:
randomgridbayesianhyperoptbohb - optuna:
randomtpe
Except for
scikit-learn, all the other search libraries are not hard dependencies of pycaret and must be installed separately. - scikit-learn:
-
Early Stopping: Early stopping now supported for hyperparameter tuning. A new parameter
early_stoppingis added in thetune_modelfunction forpycaret.classificationandpycaret.regression. It is ignored whensearch_libraryisscikit-learn, or if the estimator doesn't have a 'partial_fit' attribute. It can be either an object accepted by the search library or one of the following:ashafor Asynchronous Successive Halving Algorithmhyperbandfor Hyperbandmedianfor median stopping rule- When
FalseorNone, early stopping will not be used.
-
Iterative Imputation: Iterative imputation type for numeric and categorical missing values is now implemented. New parameters
imputation_type,iterative_imptutation_iters,categorical_iterative_imputer, andnumeric_iterative_imputeradded in thesetupfunction. Read the blog post for more details: https://www.linkedin.com/pulse/iterative-imputation-pycaret-22-antoni-baum/?trackingId=Shg1zF%2F%2FR5BE7XFpzfTHkA%3D%3D -
New Plots: Following new plots have been added:
- lift
pycaret.classification - gain
pycaret.classification - tree
pycaret.classificationpycaret.regression - feature_all
pycaret.classificationpycaret.regression
- lift
-
CatBoost Compatibility:
CatBoostClassifierandCatBoostRegressoris now compatible withplot_model. It requirescatboost>=0.23.2. -
Log Plots in MLFlow Server: You can now log any plot in the
MLFlowtracking server that is available in theplot_modelfunction. To log specific plots, pass a list containing plot IDs in thelog_plotsparameter. Check the documentation of theplot_modelto see all available plots. -
Data Split Stratification: A new parameter
data_split_stratifyis added in thesetupfunction ofpycaret.classificationandpycaret.regression. It controls stratification duringtrain_test_split. When set to True, will stratify by target column. To stratify on any other columns, pass a list of column names. -
Fold Strategy: A new parameter
fold_strategyis added in thesetupfunction forpycaret.classificationandpycaret.regression. By default, it is 'stratifiedkfold' forpycaret.classificationand 'kfold' forpycaret.regression. Possible values are:kfoldfor KFold CV;stratifiedkfoldfor Stratified KFold CV;groupkfoldfor Group KFold CV;timeseriesfor TimeSeriesSplit CV; or- a custom CV generator object compatible with scikit-learn.
-
Global Fold Parameter: A new parameter
foldhas been added in thesetupfunction forpycaret.classificationandpycaret.regression. It controls the number of folds to be used in cross validation. This is a global setting that can be over-written at function level by usingfoldparameter within each function. Ignored whenfold_strategyis a custom object. -
Fold Groups: Optional Group labels when
fold_strategyisgroupkfold. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing the group label. -
Transformation Pipeline: All transformations are now applied after
train_test_split. -
Data Type Handling: All data types handling internally has been changed from
int64andfloat64toint32andfloat32respectively in order to improve memory usage and performance, as well as for better compatibility with GPU-based algorithms. -
AutoML Behavior Change:
automlfunction inpycaret.classificationandpycaret.regressionis no more re-fitting the model on the entire dataset. As such, if the model needs to be fitted on the entire dataset including the holdout set,finalize_modelmust be explicitly used. -
Default Tuning Grid: Default hyperparameter tuning grid for
RandomForest,XGBoost,CatBoost, andLightGBMhas been amended to remove extreme values formax_depthand other training intense parameters to speed up the tuning process. -
Random Forest Default Values: Default value of
n_estimatorsforRandomForestClassifierandRandomForestRegressorhas been changed from10to100to make it consistent with the default behavior ofscikit-learn. -
AUC for Multiclass Classification: AUC for Multiclass target is now available in the metric evaluation.
-
Google Colab Display: All output printed on screen (information grid, score grids) is now format compatible with Google Colab resulting in semantic improvements.
-
Sampling Parameter Removed:
samplingparameter is now removed from thesetupfunction ofpycaret.classificationandpycaret.regression. -
Type Hinting: In order to make both the usage and development easier, type hints have been added to all updated pycaret functions, in accordance with best practices. Users can leverage those by using an IDE with support for type hints.
-
Documentation: All Modules documentation on the website is now retired. Updated documentation is available here: https://pycaret.readthedocs.io/en/latest/
-
get_metrics: Returns table of available metrics used for CV.
pycaret.classificationpycaret.regressionpycaret.clustering -
add_metric: Adds a custom metric for model evaluation.
pycaret.classificationpycaret.regressionpycaret.clustering -
remove_metric: Remove custom metrics.
pycaret.classificationpycaret.regressionpycaret.clustering -
save_config: save all global variables to a pickle file, allowing to later resume without rerunning the
setupfunction.pycaret.classificationpycaret.regressionpycaret.clusteringpycaret.anomaly -
load_config: Load global variables from pickle file into Python environment.
pycaret.classificationpycaret.regressionpycaret.clusteringpycaret.anomaly
pycaret.classification pycaret.regression pycaret.clustering pycaret.anomaly
Following new parameters have been added:
-
test_data: pandas.DataFrame, default = None If not None, test_data is used as a hold-out set, and the
train_sizeparameter is ignored. test_data must be labeled and the shape of data and test_data must match. -
preprocess: bool, default = True When set to False, no transformations are applied except for
train_test_splitand custom transformations passed incustom_pipelineparam. Data must be ready for modeling (no missing values, no dates, categorical data encoding) whenpreprocessis set to False. -
imputation_type: str, default = 'simple' The type of imputation to use. Can be either 'simple' or 'iterative'.
-
iterative_imputation_iters: int, default = 5 The number of iterations. Ignored when
imputation_typeis not 'iterative'. -
categorical_iterative_imputer: str, default = 'lightgbm' Estimator for iterative imputation of missing values in categorical features. Ignored when
imputation_typeis not 'iterative'. -
numeric_iterative_imputer: str, default = 'lightgbm' Estimator for iterative imputation of missing values in numeric features. Ignored when
imputation_typeis set to 'simple'. -
data_split_stratify: bool or list, default = False Controls stratification during 'train_test_split'. When set to True, will stratify by target column. To stratify on any other columns, pass a list of column names. Ignored when
data_split_shuffleis False. -
fold_strategy: str or sklearn CV generator object, default = 'stratifiedkfold' / 'kfold' Choice of cross validation strategy. Possible values are:
- 'kfold'
- 'stratifiedkfold'
- 'groupkfold'
- 'timeseries'
- a custom CV generator object compatible with scikit-learn.
-
fold: int, default = 10 The number of folds to be used in cross-validation. Must be at least 2. This is a global setting that can be over-written at the function level by using the
foldparameter. Ignored whenfold_strategyis a custom object. -
fold_shuffle: bool, default = False Controls the shuffle parameter of CV. Only applicable when
fold_strategyis 'kfold' or 'stratifiedkfold'. Ignored whenfold_strategyis a custom object. -
fold_groups: str or array-like, with shape (n_samples,), default = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
-
use_gpu: str or bool, default = False When set to 'force', will try to use GPU with all algorithms that support it, and raise exceptions if they are unavailable. When set to True, will use GPU with algorithms that support it, and fall back to CPU if they are unavailable. When False, all algorithms are trained using CPU only.
-
custom_pipeline: transformer or list of transformers or tuple, default = None* When passed, will append the custom transformers in the preprocessing pipeline and are applied on each CV fold separately and on the final fit. All the custom transformations are applied after 'train_test_split' and before pycaret's internal transformations.
pycaret.classification pycaret.regression
Following new parameters have been added:
-
cross_validation: bool = True When set to False, metrics are evaluated on holdout set.
foldparam is ignored when cross_validation is set to False. -
errors: str = "ignore" When set to 'ignore', will skip the model with exceptions and continue. If 'raise', will stop the function when exceptions are raised.
-
fit_kwargs: Optional[dict] = None Dictionary of arguments passed to the fit method of the model.
-
groups: Optional[Union[str, Any]] = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
pycaret.classification pycaret.regression
Following new parameters have been added:
-
cross_validation: bool = True When set to False, metrics are evaluated on holdout set.
foldparam is ignored when cross_validation is set to False. -
fit_kwargs: Optional[dict] = None Dictionary of arguments passed to the fit method of the model.
-
groups: Optional[Union[str, Any]] = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
Following parameters have been removed:
- ensemble - Deprecated - use
ensemble_modelfunction directly. - method - Deprecated - use
ensemble_modelfunction directly. - system - Moved to private API.
pycaret.classification pycaret.regression
Following new parameters have been added:
-
search_library: str, default = 'scikit-learn' The search library used for tuning hyperparameters. Possible values:
'scikit-learn' - default, requires no further installation https://github.com/scikit-learn/scikit-learn
'scikit-optimize' -
pip install scikit-optimizehttps://scikit-optimize.github.io/stable/'tune-sklearn' -
pip install tune-sklearn ray[tune]https://github.com/ray-project/tune-sklearn'optuna' -
pip install optunahttps://optuna.org/ -
search_algorithm: str, default = None The search algorithm depends on the
search_libraryparameter. Some search algorithms require additional libraries to be installed. When None, will use the search library-specific default algorithm.scikit-learnpossible values: - random (default) - gridscikit-optimizepossible values: - bayesian (default)tune-sklearnpossible values: - random (default) - grid - bayesianpip install scikit-optimize- hyperoptpip install hyperopt- bohbpip install hpbandster ConfigSpaceoptunapossible values: - tpe (default) - random -
early_stopping: bool or str or object, default = False Use early stopping to stop fitting to a hyperparameter configuration if it performs poorly. Ignored when
search_libraryis scikit-learn, or if the estimator does not have 'partial_fit' attribute. If False or None, early stopping will not be used. Can be either an object accepted by the search library or one of the following:- 'asha' for Asynchronous Successive Halving Algorithm
- 'hyperband' for Hyperband
- 'median' for Median Stopping Rule
- If False or None, early stopping will not be used.
-
early_stopping_max_iters: int, default = 10 The maximum number of epochs to run for each sampled configuration. Ignored if
early_stoppingis False or None. -
fit_kwargs: Optional[dict] = None Dictionary of arguments passed to the fit method of the model.
-
groups: Optional[Union[str, Any]] = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
-
return_tuner: bool, default = False When set to True, will return a tuple of (model, tuner_object).
-
tuner_verbose: bool or in, default = True If True or above 0, will print messages from the tuner. Higher values print more messages. Ignored when
verboseparam is False.
pycaret.classification pycaret.regression
Following new parameters have been added:
-
fit_kwargs: Optional[dict] = None Dictionary of arguments passed to the fit method of the model.
-
groups: Optional[Union[str, Any]] = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
pycaret.classification pycaret.regression
Following new parameters have been added:
-
fit_kwargs: Optional[dict] = None Dictionary of arguments passed to the fit method of the model.
-
groups: Optional[Union[str, Any]] = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
-
weights: list, default = None Sequence of weights (float or int) to weight the occurrences of predicted class labels (hard voting) or class probabilities before averaging (soft voting). Uses uniform weights when None.
-
The default value for the
methodparameter has been changed fromhardtoauto.
pycaret.classification pycaret.regression
Following new parameters have been added:
-
fit_kwargs: Optional[dict] = None Dictionary of arguments passed to the fit method of the model.
-
groups: Optional[Union[str, Any]] = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
pycaret.classification
Following new parameters have been added:
-
fit_kwargs: Optional[dict] = None Dictionary of arguments passed to the fit method of the model.
-
groups: Optional[Union[str, Any]] = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
pycaret.classification pycaret.regression
Following new parameters have been added:
-
fold: int or scikit-learn compatible CV generator, default = None Controls cross-validation. If None, the CV generator in the
fold_strategyparameter of thesetupfunction is used. When an integer is passed, it is interpreted as the 'n_splits' parameter of the CV generator in thesetupfunction. -
fit_kwargs: Optional[dict] = None Dictionary of arguments passed to the fit method of the model.
-
groups: Optional[Union[str, Any]] = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
pycaret.classification pycaret.regression
Following new parameters have been added:
-
fold: int or scikit-learn compatible CV generator, default = None Controls cross-validation. If None, the CV generator in the
fold_strategyparameter of thesetupfunction is used. When an integer is passed, it is interpreted as the 'n_splits' parameter of the CV generator in thesetupfunction. -
fit_kwargs: Optional[dict] = None Dictionary of arguments passed to the fit method of the model.
-
groups: Optional[Union[str, Any]] = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
pycaret.classification pycaret.regression
Following new parameters have been added:
-
fit_kwargs: Optional[dict] = None Dictionary of arguments passed to the fit method of the model.
-
groups: Optional[Union[str, Any]] = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
-
model_only: bool, default = True When set to False, only the model object is re-trained and all the transformations in Pipeline are ignored.
pycaret.classification pycaret.regression pycaret.clustering pycaret.anomaly
Following new parameters have been added:
-
internal: bool, default = False When True, will return extra columns and rows used internally.
-
raise_errors: bool, default = True When False, will suppress all exceptions, ignoring models that couldn't be created.
- Post-release
2.1a bug has been reported preventingpredict_modelfunction to work inregressionmodule in a new notebook session, whentransform_targetwas set toFalseduring model training. This issue has been fixed in PyCaret release2.1.2. To learn more about the issue: pycaret#525
- Post-release
2.1a bug has been identified in MLFlow back-end. The error is only caused whenlog_experimentin thesetupfunction is set to True and is applicable to all the modules. The cause of the error has been identified and an issue is opened withMLFlow. The error is caused byinfer_signaturefunction inmlflow.sklearn.log_modeland is only raised when there are missing values in the dataset. This issue has been fixed in PyCaret release2.1.1by skipping the signature in cases whereMLFlowraises exception.
- Model Deployment Model deployment support for
gcpandazurehas been added indeploy_modelfunction for all modules. Seedocumentationfor details. - Compare Models Budget Time new parameter
budget_timeadded incompare_modelsfunction. To set the upper limit oncompare_modelstraining time,budget_timeparameter can be used. - Feature Selection New feature selection method
borutahas been added for feature selection. By default,feature_selection_methodparameter in thesetupfunction is set toclassicbut can be set toborutafor feature selection using boruta algorithm. This change is applicable forpycaret.classificationandpycaret.regression. - Numeric Imputation New method
zerohas been added in thenumeric_imputationin thesetupfunction. When method is set tozero, missing values are replaced with constant 0. Default behavior ofnumeric_imputationis unchanged. - Plot Model New parameter
scalehas been added inplot_modelfor all modules to enable high quality images for research publications. - User Defined Loss Function You can now pass
custom_scorerfor optimizing user defined loss function intune_modelforpycaret.classificationandpycaret.regression. You must usemake_scorerfromsklearnto create custom loss function that can be passed intocustom_scorerfor thetune_modelfunction. - Change in Pipeline Behavior When using
save_modelthemodelobject is appended intoPipeline, as such the behavior ofPipelineandpredict_modelis now changed. Instead of saving alist,save_modelnow savesPipelineobject where trained model is on last position. The user functionality on front-end forpredict_modelremains same. - Compare Models parameter
blacklistandwhitelistis now renamed toexcludeandincludewith no change in functionality. - Predict Model Labels The
Labelcolumn returned bypredict_modelfunction inpycaret.classificationnow returns the original label instead of encoded value. This change is made to make output frompredict_modelmore human-readable. A new parameterencoded_labelsis added, which isFalseby default. When set toTrue, it will return encoded labels. - Model Logging Model persistence in the backend when
log_experimentis set toTrueis now changed. Instead of using internalsave_modelfunctionality, it now adopts tomlflow.sklearn.save_modelto allow the use of Model Registry andMLFlownative deployment functionalities. - CatBoost Compatibility
CatBoostClassifieris now compatible withblend_modelsinpycaret.classification. As suchblend_modelswithout anyestimator_listwill now result in blending total of15estimators includingCatBoostClassifier. - Stack Models
stack_modelsinpycaret.classificationandpycaret.regressionnow adopts toStackingClassifier()andStackingRegressorfromsklearn. As such thestack_modelsfunction now returnssklearnobject instead of customlistin previous versions. - Create Stacknet
create_stacknetinpycaret.classificationandpycaret.regressionis now removed. - Tune Model
tune_modelinpycaret.classificationandpycaret.regressionnow inherits params from the inputestimator. As such if you have trainedxgboost,lightgbmorcatbooston gpu will not inherits training method fromestimator. - Interpret Model
**kwargsargument now added ininterpret_model. - Pandas Categorical Type All modules are now compatible with
pandas.Categoricalobject. Internally they are converted into object and are treated as the same way asobjectorboolis treated. - use_gpu A new parameter added in the
setupfunction forpycaret.classificationandpycaret.regression. In2.1it was added to prepare for the backend work required to make this change in future releases. As such usinguse_gpuparam in2.1has no impact. - Unit Tests Unit testing enhanced. Continious improvement in progress https://github.com/pycaret/pycaret/tree/master/pycaret/tests
- Automated Documentation Added Automated documentation now added. Documentation on Website will only update for
majorreleases 0.X. For all minor monthly releases, documentation will be available on: https://pycaret.readthedocs.io/en/latest/ - Introduction of GitHub Actions CI/CD build testing is now moved from
travis-citogithub-actions.pycaret-nightlyis now being published every 24 hours automatically. - Tutorials All tutorials are now updated using
pycaret==2.0. https://github.com/pycaret/pycaret/tree/master/tutorials - Resources New resources added under
/pycaret/resources/https://github.com/pycaret/pycaret/tree/master/resources - Example Notebook Many example notebooks added under
/pycaret/examples/https://github.com/pycaret/pycaret/tree/master/examples
- Experiment Logging MLFlow logging backend added. New parameters
log_experimentexperiment_namelog_profilelog_dataadded insetup. Available inpycaret.classificationpycaret.regressionpycaret.clusteringpycaret.anomalypycaret.nlp - Save / Load Experiment
save_experimentandload_experimentfunction frompycaret.classificationpycaret.regressionpycaret.clusteringpycaret.anomalypycaret.nlpis removed in PyCaret 2.0 - System Logging System log files now generated when
setupis executed.logs.logfile is saved in current working directory. Functionget_system_logscan be used to access log file in notebook. - Command Line Support When using PyCaret 2.0 outside of Notebook,
htmlparameter insetupmust be set to False. - Imbalance Dataset
fix_imbalanceandfix_imbalance_methodparameter added insetupforpycaret.classification. When set to True, SMOTE is applied by default to create synthetic datapoints for minority class. To change the method pass any class fromimblearnthat supportsfit_resamplemethod infix_imbalance_methodparameter. - Save Plot
saveparameter added inplot_model. When set to True, it saves the plot aspngorhtmlin current working directory. - kwargs
kwargs**added increate_modelforpycaret.classificationpycaret.regressionpycaret.clusteringpycaret.anomaly - choose_better
choose_betterandoptimizeparameter added intune_modelensemble_modelblend_modelsstack_modelscreate_stacknetinpycaret.classificationandpycaret.regression. Read the details below to learn more about thi added increate_modelforpycaret.classificationpycaret.regressionpycaret.clusteringpycaret.anomaly - Training Time
TT (Sec)added incompare_modelsfunction forpycaret.classificationandpycaret.regression - New Metric: MCC
MCCmetric added in score grid forpycaret.classification - NEW FUNCTION: automl() New function
automladded inpycaret.classificationpycaret.regression - NEW FUNCTION: pull() New function
pulladded inpycaret.classificationpycaret.regression - NEW FUNCTION: models() New function
modelsadded inpycaret.classificationpycaret.regressionpycaret.clusteringpycaret.anomalypycaret.nlp - NEW FUNCTION: get_logs() New function
get_logsadded inpycaret.classificationpycaret.regressionpycaret.clusteringpycaret.anomalypycaret.nlp - NEW FUNCTION: get_config() New function
get_configadded inpycaret.classificationpycaret.regressionpycaret.clusteringpycaret.anomalypycaret.nlp - NEW FUNCTION: set_config() New function
set_configadded inpycaret.classificationpycaret.regressionpycaret.clusteringpycaret.anomalypycaret.nlp - NEW FUNCTION: get_system_logs New function
get_logsadded inpycaret.classificationpycaret.regressionpycaret.clusteringpycaret.anomalypycaret.nlp - CHANGE IN BEHAVIOR: compare_models
compare_modelsnow returns top_n models defined byn_selectparameter, by default set to 1. - CHANGE IN BEHAVIOR: tune_model
tune_modelfunction inpycaret.classificationandpycaret.regressionnow requires trained model object to be passed asestimatorinstead of string abbreviation / ID. - REMOVED DEPENDENCIES
awscliandshapremoved from requirements.txt. To useinterpret_modelfunction inpycaret.classificationpycaret.regressionanddeploy_modelfunction inpycaret.classificationpycaret.regressionpycaret.clusteringpycaret.anomaly, these libraries will have to be installed separately.
pycaret.classification pycaret.regression pycaret.clustering pycaret.anomaly pycaret.nlp
remove_perfect_collinearityparameter added insetup(). Default set to False.
When set to True, perfect collinearity (features with correlation = 1) is removed from the dataset, When two features are 100% correlated, one of it is randomly dropped from the dataset.fix_imbalanceparameter added insetup(). Default set to False.
When dataset has unequal distribution of target class it can be fixed using fix_imbalance parameter. When set to True, SMOTE (Synthetic Minority Over-sampling Technique) is applied by default to create synthetic datapoints for minority class.fix_imbalance_methodparameter added insetup(). Default set to None.
When fix_imbalance is set to True and fix_imbalance_method is None, 'smote' is applied by default to oversample minority class during cross validation. This parameter accepts any module from 'imblearn' that supports 'fit_resample' method.data_split_shuffleparameter added insetup(). Default set to True.
If set to False, prevents shuffling of rows when splitting data.folds_shuffleparameter added insetup(). Default set to False.
If set to False, prevents shuffling of rows when using cross validation.n_jobsparameter added insetup(). Default set to -1.
The number of jobs to run in parallel (for functions that supports parallel processing) -1 means using all processors. To run all functions on single processor set n_jobs to None.htmlparameter added insetup(). Default set to True.
If set to False, prevents runtime display of monitor. This must be set to False when using environment that doesnt support HTML.log_experimentparameter added insetup(). Default set to False.
When set to True, all metrics and parameters are logged on MLFlow server.experiment_nameparameter added insetup(). Default set to None.
Name of experiment for logging. When set to None, 'clf' is by default used as alias for the experiment name.log_plotsparameter added insetup(). Default set to False.
When set to True, specific plots are logged in MLflow as a png file.log_profileparameter added insetup(). Default set to False.
When set to True, data profile is also logged on MLflow as a html file.log_dataparameter added insetup(). Default set to False.
When set to True, train and test dataset are logged as csv.verboseparameter added insetup(). Default set to True.
Information grid is not printed when verbose is set to False.
pycaret.classification pycaret.regression
whitelistparameter added incompare_models. Default set to None.
In order to run only certain models for the comparison, the model ID's can be passed as a list of strings in whitelist param.n_selectparameter added incompare_models. Default set to 1.
Number of top_n models to return. use negative argument for bottom selection. For example, n_select = -3 means bottom 3 models.verboseparameter added incompare_models. Default set to True.
Score grid is not printed when verbose is set to False.
pycaret.classification pycaret.regression pycaret.clustering pycaret.anomaly
cross_validationparameter added increate_model. Default set to True.
When cross_validation set to False fold parameter is ignored and model is trained on entire training dataset. No metric evaluation is returned. Only applicable inpycaret.classificationandpycaret.regressionsystemparameter added increate_model. Default set to True.
Must remain True all times. Only to be changed by internal functions.ground_truthparameter added increate_model. Default set to None.
When ground_truth is provided, Homogeneity Score, Rand Index, and Completeness Score is evaluated and printer along with other metrics. This is only available inpycaret.clusteringkwargsparameter added increate_model.
Additional keyword arguments to pass to the estimator.
pycaret.classification pycaret.regression pycaret.clustering pycaret.anomaly pycaret.nlp
custom_gridparameter added intune_model. Default set to None.
To use custom hyperparameters for tuning pass a dictionary with parameter name and values to be iterated. When set to None it uses pre-defined tuning grid. Forpycaret.clusteringpycaret.anomalypycaret.nlp, custom_grid param must be a list of values to iterate over.choose_betterparameter added intune_model. Default set to False.
When set to set to True, base estimator is returned when the performance doesn't improve by tune_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.
pycaret.classification pycaret.regression
choose_betterparameter added inensemble_model. Default set to False.
When set to set to True, base estimator is returned when the performance doesn't improve by tune_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.optimizeparameter added inensemble_model. Default set toAccuracyforpycaret.classificationandR2forpycaret.regression.
Only used when choose_better is set to True. optimize parameter is used to compare emsembled model with base estimator. Values accepted in optimize parameter forpycaret.classificationare 'Accuracy', 'AUC', 'Recall', 'Precision', 'F1', 'Kappa', 'MCC' and forpycaret.regressionare 'MAE', 'MSE', 'RMSE' 'R2', 'RMSLE' and 'MAPE'.
pycaret.classification pycaret.regression
choose_betterparameter added inblend_models. Default set to False.
When set to set to True, base estimator is returned when the performance doesn't improve by tune_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.optimizeparameter added inblend_models. Default set toAccuracyforpycaret.classificationandR2forpycaret.regression.
Only used when choose_better is set to True. optimize parameter is used to compare emsembled model with base estimator. Values accepted in optimize parameter forpycaret.classificationare 'Accuracy', 'AUC', 'Recall', 'Precision', 'F1', 'Kappa', 'MCC' and forpycaret.regressionare 'MAE', 'MSE', 'RMSE' 'R2', 'RMSLE' and 'MAPE'.
pycaret.classification pycaret.regression
choose_betterparameter added instack_models. Default set to False.
When set to set to True, base estimator is returned when the performance doesn't improve by tune_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.optimizeparameter added instack_models. Default set toAccuracyforpycaret.classificationandR2forpycaret.regression.
Only used when choose_better is set to True. optimize parameter is used to compare emsembled model with base estimator. Values accepted in optimize parameter forpycaret.classificationare 'Accuracy', 'AUC', 'Recall', 'Precision', 'F1', 'Kappa', 'MCC' and forpycaret.regressionare 'MAE', 'MSE', 'RMSE' 'R2', 'RMSLE' and 'MAPE'.
pycaret.classification pycaret.regression
choose_betterparameter added increate_stacknet. Default set to False.
When set to set to True, base estimator is returned when the performance doesn't improve by tune_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.optimizeparameter added increate_stacknet. Default set toAccuracyforpycaret.classificationandR2forpycaret.regression.
Only used when choose_better is set to True. optimize parameter is used to compare emsembled model with base estimator. Values accepted in optimize parameter forpycaret.classificationare 'Accuracy', 'AUC', 'Recall', 'Precision', 'F1', 'Kappa', 'MCC' and forpycaret.regressionare 'MAE', 'MSE', 'RMSE' 'R2', 'RMSLE' and 'MAPE'.
pycaret.classification pycaret.regression
verboseparameter added inpredict_model. Default set to True.
Holdout score grid is not printed when verbose is set to False.
pycaret.classification pycaret.regression pycaret.clustering pycaret.anomaly pycaret.nlp
saveparameter added inplot_model. Default set to False.
When set to True, Plot is saved as a 'png' file in current working directory.
verboseparameter added inplot_model. Default set to True.
Progress bar not shown when verbose set to False.
systemparameter added inplot_model. Default set to True.
Must remain True all times. Only to be changed by internal functions.
pycaret.classification pycaret.regression
- This function returns the best model out of all models created in current active environment based on metric defined in optimize parameter.
optimizestring, default = 'Accuracy' forpycaret.classificationand 'R2' forpycaret.regression
Other values you can pass in optimize param are 'AUC', 'Recall', 'Precision', 'F1', 'Kappa', and 'MCC' forpycaret.classificationand 'MAE', 'MSE', 'RMSE', 'R2', 'RMSLE', and 'MAPE' forpycaret.regressionuse_holdoutbool, default = False
When set to True, metrics are evaluated on holdout set instead of CV.
pycaret.classification pycaret.regression
- This function returns the last printed score grid as pandas dataframe.
pycaret.classification pycaret.regression pycaret.clustering pycaret.anomaly pycaret.nlp
- This function Returns the table of models available in model library.
typestring, default = None
linear : filters and only return linear models
tree : filters and only return tree based models
ensemble : filters and only return ensemble models
type parameter only available in pycaret.classification and pycaret.regression
pycaret.classification pycaret.regression pycaret.clustering pycaret.anomaly pycaret.nlp
- This function returns a table with experiment logs consisting run details, parameter, metrics and tags.
-
experiment_namestring, default = None
When set to None current active run is used. -
savebool, default = False
When set to True, csv file is saved in current directory.
pycaret.classification pycaret.regression pycaret.clustering pycaret.anomaly pycaret.nlp
- This function is used to access global environment variables. Check docstring for the list of global var accessible.
pycaret.classification pycaret.regression pycaret.clustering pycaret.anomaly pycaret.nlp
- This function is used to reset global environment variables. Check docstring for the list of global var accessible.
pycaret.classification pycaret.regression pycaret.clustering pycaret.anomaly pycaret.nlp
- This function is reads and print 'logs.log' file from current active directory. logs.log is generated from
setupis initialized in any module.