1- # MLRun Functions Hub
1+ # MLRun Hub
22
33A centralized repository for open-source MLRun functions, modules, and steps that can be used as reusable components in ML pipelines.
44
@@ -20,7 +20,7 @@ A centralized repository for open-source MLRun functions, modules, and steps tha
2020
2121Before you begin, ensure you have the following installed:
2222
23- - ** Python 3.10 or 3.11** - Required
23+ - ** Python 3.10 or 3.11 (recommended) ** - Required
2424- ** UV** - Fast Python package manager (required)
2525- ** Git** - For version control
2626- ** Make** (optional) - For convenient command shortcuts
@@ -317,7 +317,6 @@ We follow **PEP 8** style guidelines with some modifications:
317317
318318- ** Line length** : 88 characters (Black default)
319319- ** Imports** : Sorted with isort
320- - ** Docstrings** : Google style or NumPy style
321320- ** Type hints** : Encouraged for function signatures
322321
323322### Formatting Tools
@@ -351,29 +350,78 @@ uv run isort --check-only .
351350
352351### Documentation Standards
353352
354- - ** Docstrings are mandatory** for all public functions, classes, and modules
353+ - ** Docstrings are mandatory** for all public hub items
355354- Use clear, concise descriptions
356355- Include parameter types and return types
357356- Provide usage examples when helpful
358357
359- ** Example:**
358+ ** Example (function ` auto_trainer ` ) :**
360359``` python
361- def train_model (data : pd.DataFrame, target_column : str , model_type : str = " sklearn" ) -> dict :
360+ def train (
361+ context : MLClientCtx,
362+ dataset : DataItem,
363+ model_class : str ,
364+ label_columns : Optional[Union[str , List[str ]]] = None ,
365+ drop_columns : List[str ] = None ,
366+ model_name : str = " model" ,
367+ tag : str = " " ,
368+ sample_set : DataItem = None ,
369+ test_set : DataItem = None ,
370+ train_test_split_size : float = None ,
371+ random_state : int = None ,
372+ labels : dict = None ,
373+ ** kwargs ,
374+ ):
362375 """
363- Train a machine learning model on the provided dataset.
364-
365- Args:
366- data: Input DataFrame containing features and target
367- target_column: Name of the target column
368- model_type: Type of model to train (default: "sklearn")
369-
370- Returns:
371- Dictionary containing the trained model and metrics
372-
373- Example:
374- >>> result = train_model(df, "label", "sklearn")
375- >>> print(result["accuracy"])
376- 0.95
376+ Training a model with the given dataset.
377+
378+ example::
379+
380+ import mlrun
381+ project = mlrun.get_or_create_project("my-project")
382+ project.set_function("hub://auto_trainer", "train")
383+ trainer_run = project.run(
384+ name="train",
385+ handler="train",
386+ inputs={"dataset": "./path/to/dataset.csv"},
387+ params={
388+ "model_class": "sklearn.linear_model.LogisticRegression",
389+ "label_columns": "label",
390+ "drop_columns": "id",
391+ "model_name": "my-model",
392+ "tag": "v1.0.0",
393+ "sample_set": "./path/to/sample_set.csv",
394+ "test_set": "./path/to/test_set.csv",
395+ "CLASS_solver": "liblinear",
396+ },
397+ )
398+
399+ :param context: MLRun context
400+ :param dataset: The dataset to train the model on. Can be either a URI or a FeatureVector
401+ :param model_class: The class of the model, e.g. `sklearn.linear_model.LogisticRegression`
402+ :param label_columns: The target label(s) of the column(s) in the dataset. for Regression or
403+ Classification tasks. Mandatory when dataset is not a FeatureVector.
404+ :param drop_columns: str or a list of strings that represent the columns to drop
405+ :param model_name: The model's name to use for storing the model artifact, default to 'model'
406+ :param tag: The model's tag to log with
407+ :param sample_set: A sample set of inputs for the model for logging its stats along the model in favour
408+ of model monitoring. Can be either a URI or a FeatureVector
409+ :param test_set: The test set to train the model with.
410+ :param train_test_split_size: if test_set was provided then this argument is ignored.
411+ Should be between 0.0 and 1.0 and represent the proportion of the dataset to include
412+ in the test split. The size of the Training set is set to the complement of this
413+ value. Default = 0.2
414+ :param random_state: Relevant only when using train_test_split_size.
415+ A random state seed to shuffle the data. For more information, see:
416+ https://scikit-learn.org/stable/glossary.html#term-random_state
417+ Notice that here we only pass integer values.
418+ :param labels: Labels to log with the model
419+ :param kwargs: Here you can pass keyword arguments with prefixes,
420+ that will be parsed and passed to the relevant function, by the following prefixes:
421+ - `CLASS_` - for the model class arguments
422+ - `FIT_` - for the `fit` function arguments
423+ - `TRAIN_` - for the `train` function (in xgb or lgbm train function - future)
424+
377425 """
378426 # Implementation here
379427```
@@ -392,10 +440,8 @@ def train_model(data: pd.DataFrame, target_column: str, model_type: str = "sklea
392440pwd # Should be the project root
393441
394442# Ensure dependencies are installed
395- make install
443+ make sync
396444
397- # Try running with full path
398- python -m cli.cli --help
399445```
400446
401447#### Tests Failing
@@ -404,7 +450,7 @@ python -m cli.cli --help
404450
405451** Solution:**
406452``` bash
407- # Install test dependencies if the function has a requirements.txt
453+ # Install test dependencies if the item has a requirements.txt
408454cd functions/src/your_function
409455uv pip install -r requirements.txt
410456
0 commit comments