docs for custom eval metric (#470)

pplonski · pplonski · commit 4131eaa360d1 · 2026-05-29T14:06:51.000+02:00
diff --git a/README.md b/README.md
@@ -214,6 +214,8 @@ All models are automatically saved to be able to restore the training after inte
 - for multiclass classification: `logloss`, `f1`, `accuracy` - default is `logloss`
 - for regression: `rmse`, `mse`, `mae`, `r2`, `mape`, `spearman`, `pearson` - default is `rmse`
 
+You can also pass a custom Python function directly as `eval_metric`. See the docs for [Custom eval metric](https://supervised.mljar.com/features/custom-eval-metric/).
+
 If you don't find the `eval_metric` that you need, please add a new issue. We will add it.
 
 
diff --git a/docs/docs/api.md b/docs/docs/api.md
@@ -7,6 +7,7 @@ social:
 # API documentation
 
 If you are looking for how trained models are stored and reloaded, see [Save and Load models](features/save-and-load-models.md).
+If you need a user-defined evaluation function, see [Custom eval metric](features/custom-eval-metric.md).
 
 ## `AutoML` class
 
diff --git a/docs/docs/features/custom-eval-metric.md b/docs/docs/features/custom-eval-metric.md
@@ -0,0 +1,133 @@
+---
+description: How to use a custom evaluation metric in MLJAR AutoML by passing a Python function directly as eval_metric.
+social:
+  cards_layout: default/variant
+---
+
+# Custom eval metric
+
+`mljar-supervised` supports custom evaluation metrics.
+
+You can pass your own Python function directly as the `eval_metric` argument in `AutoML`.
+
+## Basic usage
+
+The function should have this interface:
+
+```python
+def my_custom_metric(y_true, y_predicted, sample_weight=None):
+    # compute score
+    return score
+```
+
+Then use it directly:
+
+```python
+from supervised import AutoML
+
+automl = AutoML(
+    results_path="AutoML_custom_metric",
+    eval_metric=my_custom_metric,
+)
+automl.fit(X, y)
+```
+
+## Important rule: the metric must be minimized
+
+Custom metrics in `mljar-supervised` are always treated as metrics to minimize.
+
+This means:
+
+- if lower is better, return the value directly
+- if higher is better, return its negative value
+
+For example:
+
+- MSE can be returned directly
+- precision, F1, or AUC should usually return `-value`
+
+## Regression example
+
+```python
+import numpy as np
+from supervised import AutoML
+
+def custom_mse(y_true, y_predicted, sample_weight=None):
+    y_true = np.asarray(y_true)
+    y_predicted = np.asarray(y_predicted)
+    return np.mean((y_true - y_predicted) ** 2)
+
+automl = AutoML(
+    results_path="AutoML_regression_custom_metric",
+    eval_metric=custom_mse,
+)
+automl.fit(X, y)
+```
+
+## Classification example
+
+For classification, `y_predicted` can contain probabilities, so you may need to apply thresholding or `argmax` inside your metric.
+
+```python
+import numpy as np
+from sklearn.metrics import precision_score
+from supervised import AutoML
+
+def positive_class_precision(y_true, y_predicted, sample_weight=None):
+    y_true = np.asarray(y_true)
+    y_predicted = np.asarray(y_predicted)
+
+    if y_predicted.ndim == 2 and y_predicted.shape[1] == 1:
+        y_predicted = y_predicted.ravel()
+
+    if y_predicted.ndim == 1:
+        y_predicted = (y_predicted > 0.5).astype(int)
+    else:
+        y_predicted = np.argmax(y_predicted, axis=1)
+
+    value = precision_score(y_true, y_predicted, sample_weight=sample_weight)
+
+    # higher precision is better, so return negative value
+    return -value
+
+automl = AutoML(
+    results_path="AutoML_classification_custom_metric",
+    eval_metric=positive_class_precision,
+)
+automl.fit(X, y)
+```
+
+## Notes
+
+- the metric function must return a single numeric value
+- the metric should handle `sample_weight=None`
+- the metric will be used for early stopping and model selection
+- the metric should be deterministic and reasonably fast
+
+## FAQ
+
+### Can I pass a function directly?
+
+Yes. This is the supported public interface:
+
+```python
+automl = AutoML(eval_metric=my_custom_metric)
+```
+
+### Should I pass `eval_metric="user_defined_metric"`?
+
+No. That name is used internally. In user code, pass the function itself.
+
+### Can I maximize my metric directly?
+
+No. Convert it to a minimization target, usually by returning `-value`.
+
+### Why do I need thresholding for some classification metrics?
+
+Because many classification metrics such as precision or F1 expect class labels, while model predictions during evaluation can be probabilities.
+
+## Related pages
+
+- [AutoML API](../api.md)
+- [Save and Load models](save-and-load-models.md)
+- [Preprocessing](preprocessing.md)
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
@@ -72,6 +72,7 @@ nav:
   - Get started: index.md
   - Features: 
     - Apps: features/apps.md
+    - Custom eval metric: features/custom-eval-metric.md
     - Preprocessing: features/preprocessing.md
     - Save and Load models: features/save-and-load-models.md
     - Steps of AutoML: features/automl.md
diff --git a/supervised/automl.py b/supervised/automl.py
@@ -150,12 +150,21 @@ def __init__(
 
             stack_models (boolean): Whether a models stack gets created at the end of the training. Stack level is 1.
 
-            eval_metric (str): The metric to be used in early stopping and to compare models.
+            eval_metric (str or function): The metric to be used in early stopping and to compare models.
 
                 - for binary classification: `logloss`, `auc`, `f1`, `average_precision`, `accuracy` - default is logloss (if left "auto")
                 - for mutliclass classification: `logloss`, `f1`, `accuracy` - default is `logloss` (if left "auto")
                 - for regression: `rmse`, `mse`, `mae`, `r2`, `mape`, `spearman`, `pearson` - default is `rmse` (if left "auto")
 
+                You can also pass a custom Python function directly. The expected interface is:
+
+                `def my_metric(y_true, y_predicted, sample_weight=None): return score`
+
+                The returned value is always minimized. If you want to maximize a metric,
+                for example precision or F1, return its negative value. For classification
+                tasks, `y_predicted` can contain probabilities, so thresholding or `argmax`
+                might be needed inside the custom metric.
+
             validation_strategy (dict): Dictionary with validation type. Right now train/test split and cross-validation are supported.
 
                 Example: