A package to streamline common code chunks executed by students in the UBC MDS program circa 2025.
- compare_model_scores() - a function that takes multiple models and returns a table of mean CV scores for each for easy comparison.
- perform_eda() - a function to perform exploratory data analysis on a dataset
- dataset_summary() - a function that generates a comprehensive summary of a dataset, including missing value statistics, feature counts, duplicate rows, and descriptive statistics.
- htv() - (Hypothesis Test Visualization) provide good plots for user's hypothesis test result, easier to understand what happend in test rather than just number.
-
While this package extends cross-validation from scikit-learn, there are no known packages that provide CV score comparison similar to compare_model_scores(). The most similar is the summary_cv() function in the CrossPy package, which summarizes CV scores for a single model.
-
While the
ProfileReportclass from the ydata-profiling package provides automated exploratory data analysis and reporting, there are no known packages that offer the same level of flexible, on-demand visualizations and insights as theperform_eda()function. The most similar functionality is available in pandas-profiling, which generates detailed HTML reports but lacks the modular, interactive approach thatperform_eda()provides for tailoring EDA to specific datasets and workflows. -
The
dataset_summary()function combines essential dataset insights—missing values, feature types, duplicates, and basic statistics—into one comprehensive and easy-to-use tool. While similar functionality exists in libraries like pandas-profiling and missingno, these tools focus on specific aspects or full-scale exploratory analysis. No single function consolidates all these features in one place, makingdataset_summary()a uniquely efficient solution for preprocessing workflows. -
There is no similar function could provide plot for hypothesis test output. Data Scientist do it manually, but it is not friendly for learner.
$ pip install mds_2025_helper_functionsThis function compares the mean cross-validation scores of multiple ML models and produces a summary table.
*args(BaseEstimator): Models to evaluate (e.g.,LogisticRegression,RandomForestClassifier, etc.).X(array-like): Training dataset of features with shape(n_samples, n_features).y(array-like, optional): Target values for supervised learning tasks.scoring(string or callable, optional): Evaluation metrics (e.g.,"accuracy"). Refer to the Scikit-learn scoring documentation.return_train_scores(bool): Whether to include training scores in addition to test scores. Default isFalse.**kwargs: Additional arguments forsklearn.model_selection.cross_validate.
A pandas.DataFrame comparing the performance of the models.
from mds_2025_helper_functions.scores import compare_model_scores
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_diabetes
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasetsimport warnings
warnings.filterwarning.import warningsfilterwarnings de```
``` from sklearn.datasets import load iris
It seems the last few messages got jumbled up. Let me correct it and present the complete optimized README usage documentation without unnecessary comments.
---
```markdown
# Function Documentation and Usage
## 1. `compare_model_scores`
### Description:
This function compares the mean cross-validation scores of multiple ML models and produces a summary table.
### Parameters:
- `*args` (BaseEstimator): Models to evaluate (e.g., `LogisticRegression`, `RandomForestClassifier`, etc.).
- `X` (array-like): Training dataset of features with shape `(n_samples, n_features)`.
- `y` (array-like, optional): Target values for supervised learning tasks.
- `scoring` (string or callable, optional): Evaluation metrics (e.g., `"accuracy"`). Refer to the [Scikit-learn scoring documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).
- `return_train_scores` (bool): Whether to include training scores in addition to test scores. Default is `False`.
- `**kwargs`: Additional arguments for `sklearn.model_selection.cross_validate`.
### Returns:
A `pandas.DataFrame` comparing the performance of the models.
### Example Usage:
```python
from mds_2025_helper_functions.scores import compare_model_scores
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
# Load sample dataset
data = load_iris()
X, y = data["data"], data["target"]
# Compare models
compare_model_scores(
LogisticRegression(),
DecisionTreeClassifier(),
X=X,
y=y,
scoring="accuracy"
)A one-stop Exploratory Data Analysis (EDA) function to generate data summaries, spot missing values, visualize feature distributions, and detect outliers.
dataframe(pd.DataFrame): The input dataset for analysis.rows(int): Number of rows in the grid layout for visualizations. Default is 5.cols(int): Number of columns in the grid layout for visualizations. Default is 2.
- Prints dataset statistics, missing values report, and an outlier summary.
- Generates plots and visualizations using Matplotlib and Seaborn.
from mds_2025_helper_functions.eda import perform_eda
import pandas as pd
data = {
'Age': [25, 32, 47, 51, 62],
'Salary': [50000, 60000, 120000, 90000, 85000],
'Department': ['HR', 'Finance', 'IT', 'Finance', 'HR'],
}
df = pd.DataFrame(data)
perform_eda(df, rows=2, cols=2)Generates a summary of a dataset including missing values, feature types, duplicate rows, and other descriptive statistics.
data(pd.DataFrame): The dataset to summarize.
A dictionary containing:
'missing_values': DataFrame of missing value counts and percentages.'feature_types': Counts of numerical and categorical features.'duplicates': Number of duplicate rows.'numerical_summary': Descriptive statistics for numerical columns.'categorical_summary': Unique value counts for categorical features.
from mds_2025_helper_functions.dataset_summary import dataset_summary
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', None],
'Age': [25, 32, 47, None, 29],
'Salary': [50000, 60000, 120000, None, 80000],
'Department': ['HR', 'Finance', 'IT', 'HR', 'Finance']
}
df = pd.DataFrame(data)
summary = dataset_summary(df)
print(summary['missing_values'])
print(summary['numerical_summary'])
print(summary['categorical_summary'])Visualizes Type I (α) and Type II (β) errors in hypothesis tests.
test_output(dict): Dictionary containing hypothesis test parameters:'mu0'(float): Mean under the null hypothesis (H₀).'mu1'(float): Mean under the alternative hypothesis (H₁).'sigma'(float): Standard deviation.'sample_size'(int): Sample size.'df'(int, optional): Degrees of freedom, required for't'or'chi2'tests.'df1','df2'(int, optional): For F-tests (anova).
test_type(str): Type of test ('z','t','chi2', or'anova').alpha(float): Significance level for Type I error. Default is0.05.tail(str):'one-tailed'or'two-tailed'. Default is'two-tailed'.
- A tuple of
(fig, ax)for plotting the visualization.
from mds_2025_helper_functions.htv import htv
import matplotlib.pyplot as plt
test_params = {
'mu0': 100,
'mu1': 105,
'sigma': 15,
'sample_size': 30
}
fig, ax = htv(test_params, test_type="z", alpha=0.05, tail="two-tailed")
plt.show()test_params_t = {
'mu0': 0,
'mu1': 1.5,
'sigma': 1,
'sample_size': 25
}
fig, ax = htv(test_params_t, test_type="t", alpha=0.01, tail="one-tailed")
plt.show()- Required imports:
from mds_2025_helper_functions.scores import compare_model_scores from mds_2025_helper_functions.eda import perform_eda from mds_2025_helper_functions.dataset_summary import dataset_summary from mds_2025_helper_functions.htv import htv from sklearn.datasets import load_iris, load_diabetes from sklearn.dummy import DummyRegressor, DummyClassifier from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier import matplotlib.pyplot as plt import pandas as pd import warnings warnings.filterwarnings('ignore')
pytestpytest test_dataset_summary.py
pytest test_eda.py
pytest test_htv.py
pytest test_scores.pypytest test_dataset_summary.py::test_function_namepytest -vpytest --cov=.Generate an HTML coverage report:
pytest --cov=. --cov-report=htmlRun tests with 4 parallel workers:
pytest -n 4pytest --cache-clearInterested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.
Karlygash Zhakupbayeva, Samuel Adetsi, Xi Cu, Michael Hewlett
mds_2025_helper_functions was created by Karlygash Zhakupbayeva, Samuel Adetsi, Xi Cu, Michael Hewlett. It is licensed under the terms of the MIT license.
mds_2025_helper_functions was created with cookiecutter and the py-pkgs-cookiecutter template.