Missingness Analyzer


CI/CD
Documentation
Package
Meta

Canada, Vancouver

Creation of a data-science related package for DSCI 524 (Collaborative Software Development); a course in the Master of Data Science program at the University of British Columbia. 2025-2026.

Contributors

Rocco Lee, Nguyen Nguyen, Shuhang Li

About

Missing data imputation/handling is one of the most common forms of data cleaning that needs to happen in any analysis project. Large amounts of missing data can heavily skew the distribution of data or labels within the dataset, or invalidate large portions of rows in the dataset if an imputation strategy is not defined. The vision for this package is to not only give surface level analysis of how much missing data there is in a given dataset, but also to identify potential patterns to the missing data, such as Missing Completely At Random (MCAR), Missing At Random (MAR) or Missing Not At Random (MNAR), and use machine learning algorithms to give a sensible suggestion to the imputation strategy that would make sense to be used in certain contexts.

Link to package: https://test.pypi.org/project/missingness-analyzer/

Setting Up

Here's how to set up missingness_analyzer for local development:

Fork the repository: https://github.com/UBC-MDS/DSCI-524-Group_19_Missingness_Analyzer
Clone the fork locally using:

git clone [email protected]:UBC-MDS/DSCI-524-Group_19_Missingness_Analyzer.git

Then please cd into the root of the repo by:

cd DSCI-524-Group_19_Missingness_Analyzer

Create the virtual environment with:

conda env create -f environment.yml

Once the environment is created, activate it with:

conda activate 524-Group-19

Install the package with:

pip install -i https://test.pypi.org/simple/ missingness-analyzer

Develop Away!

Make sure to document your changes with comments
If you are adding new functions in new python files, ensure that the docstring for those functions are written with Numpy formatting.

Publishing Your Code

After fixing bugs or developing new features, here's how you can deploy your changes

Verify that all tests still pass with (run in terminal):

pytest

Once you have verified that all tests pass, commit and push your changes to the remote repository and create a pull request
This should automatically trigger a Github Workflow which automatically updates the HTML site containing documentation for this package, builds an artifact and deploys the changes to PyPI

The updated documentation can be found here

List of Functions

missing_how_type
- This function describes the amount of missing data in the dataset and attempts to identify the type of missingness (MCAR, MAR or MNAR). It will return
suggest_imputation
- This function takes in the dataset and the type of missingness and parses the amount of missingness and datatypes in the dataframe to suggest an imputation strategy that would be best suited. The best suited method and reasoning is returned to the user in a dictionary format.
missing_correlation_matrix
- This function takes a pandas dataframe as an argument and returns a correlation matrix of the amount of missingness to help identify the type of missingness

Usage

from missingness_analyzer.type_of_missing_and_how import missing_how_type
from missingness_analyzer.missing_correlation_matrix import missing_correlation_matrix
from missingness_analyzer.suggest_imputation import suggest_imputation
import pandas as pd
import numpy as np

df = pd.DataFrame({'age': [25, np.nan, 35], 'income': [50000, 60000, np.nan]})

# suggest_imputation
results = suggest_imputation(df)
print(results['method']) 
>>> KNNImputer (k=5)

# missing_correlation_matrix
missing_correlation_matrix(df)
>>>
            age     income
age         1.0     -0.5
income      -0.5    1.0


# missing_how_type
missing_how_type(df)
>>>
This data frame have 2 missing values, below is the number of missing values for each column:
age        1
income     1
dtype: int64

- Columns with True value is Missing Completely at Random (MCAR)
- Columns with False value are either Missing at Random (MAR) or Missing Not at Random (MNAR)
- Since MAR and MNAR cannot be tested statistically and formally, additional domain expertise is needed for further investigation

        MCAR
target  
age     True
income  True

Dataset Acknowledgement

This project was developed using the following dataset:

Dataset name: Retail Product Dataset with Missing Values
Source: Kaggle
License: CC0 1.0 Universal (Public Domain)

Contributing

Please see CONTRIBUTING.md for guidelines on how to contribute to this package.

Code of Conduct

Please note that this project is released with a Code of Conduct. By participating in this project you agree to abide by its terms.

License

This project is licensed under the MIT License, please see LICENSE file for details.

Citation

If you use this package, please cite as the following:

Lee, R., Nguyen, N., & Li, S. (2026) missingness-analyzer (Version 0.5.3).
https://test.pypi.org/project/missingness-analyzer/

Python Ecosystem

Below is a summary of existing packages related to our topic:

scikit-na (https://pypi.org/project/scikit-na/) This is a package that contains functions for statistical analysis, building visuals and export capabilities for helping data scientists understand and handle missing values in their datasets.

mdatagen (https://github.com/ArthurMangussi/pymdatagen) This GitHub repo contains a project for artificially generating data for missing fields.

Other Existing Packages (e.g. deepchecks) Other packages like deepchecks have functions that can be used to write tests to detect if the amount of missing data in a dataset passes a set threshold. What differentiates our package from existing ones is the implementation of a smart imputation function which suggests an imputation method based on not only the type of missingness present in the dataframe, but also takes into account the datatypes of the columns in the dataframe. Also included are two helper functions which aid the user in identifying the type of missingness present in the input as well as a handy function to display a correlation matrix of the missing data.

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
.github/workflows		.github/workflows
docs		docs
reference		reference
src/missingness_analyzer		src/missingness_analyzer
tests		tests
.DS_Store		.DS_Store
.Rhistory		.Rhistory
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
_quarto.yml		_quarto.yml
environment.yml		environment.yml
index.qmd		index.qmd
objects.json		objects.json
pyproject.toml		pyproject.toml
retrospective.qmd		retrospective.qmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Missingness Analyzer

Contributors

About

Setting Up

Publishing Your Code

List of Functions

Usage

Dataset Acknowledgement

Contributing

Code of Conduct

License

Citation

Python Ecosystem

About

Uh oh!

Releases 2

Packages

Contributors 3

Uh oh!

Languages

License

UBC-MDS/DSCI-524-Group_19_Missingness_Analyzer

Folders and files

Latest commit

History

Repository files navigation

Missingness Analyzer

Contributors

About

Setting Up

Publishing Your Code

List of Functions

Usage

Dataset Acknowledgement

Contributing

Code of Conduct

License

Citation

Python Ecosystem

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 3

Uh oh!

Languages

Packages