- Welcome to survey_cleaner
| Testing | |
| Package | |
| Meta |
survey_cleaner is a project that aims to streamline the process of cleaning survey data by automating common cleaning tasks. Designed to generalize to survey data on different topics, survey_cleaner provides functions to remove duplicate responses, remove unnecessary whitespaces, normalize responses to binary format, and convert ordinal-type responses to numeric data. The package sets up a standardized cleaning framework that can be carried across multiple projects and helps users to reduce manual preprocessing time and minimize errors.
remove_duplicates: keeps only the latest survey response from each individual.handle_emptyStrings: handle None, raise TypeError for non-string inputs, collapse all whitespace into single spaces and strip leading/trailing whitespace, and write the corresponding docstring.normalize_binary: converts binary responses such as True and False, T and F, or Yes and No to a binary format (0 and 1).word_to_ordinal: gives ranking words such as Best, Better, Good, Bad, Worst a numerical rating so that responses can be organized by their numerical values. Likert scale are set up as default rankings but users can also provide their own rankings.
While there are a number of text cleaning packages available on PyPi such as clean-text which preprocesses raw text data on the web, there is no package that is specifically dedicated to cleaning survey response data which is something the survey_cleaner package addresses.
You can install the latest release of survey_cleaner from TestPyPI using pip:
$ pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ survey_cleanerfrom survey_cleaner import handle_emptyStrings
import pandas as pd
# Removes leading/trailing whitespace and collapses multiple spaces
df['comments'] = df['comments'].apply(handle_emptyStrings)from survey_cleaner import normalize_binary
import pandas as pd
# Converts Yes/No, True/False, T/F to 1/0
df = pd.DataFrame({'response': ['Yes', 'No', 'Yes']})
df['response'] = df['response'].apply(normalize_binary)from survey_cleaner import word_to_ordinal
import pandas as pd
feedback = pd.Series(["strongly agree", "agree",
"neither agree nor disagree", "disagree"])
# Customized mapping, warnning for unmapped values
word_to_ordinal(feedback, mapping={"strongly agree": 5, "Bad": 0})
# Using default Likert scale
word_to_ordinal(feedback, likert="agreement")from survey_cleaner import remove_duplicates
responses = pd.DataFrame({
'respondent_id': [1, 2, 1, 3],
'completed_at': ['2024-01-01 10:00', '2024-01-01 11:00',
'2024-01-01 12:00', '2024-01-01 13:00'],
'answer': ['Yes', 'No', 'Maybe', 'Yes']
})
clean_responses = remove_duplicates(responses, 'respondent_id', 'completed_at')Clone the repository to your local:
$ git clone https://github.com/UBC-MDS/DSCI_524_group35_survey_cleaner.git
$ cd DSCI_524_group35_survey_cleanerIt is recommended but not required to use the environment file to create a conda environment:
$ conda env create -f environment.yml
$ conda activate survey_cleanerYou can install this package in development mode
$ pip install -e ".[docs]"Run the test suite:
$ pytest tests/Install documentation dependencies
$ pip install -e ".[docs]"Generate API documentation using quartodoc
quartodoc buildPreview the documentation
quarto previewRender the final HTML output
quarto render| Workflow | Trigger | Purpose |
|---|---|---|
build.yml |
Push/PR to main | Runs tests and builds package |
deploy.yml |
Push to main (after tests pass) | Deploys package to TestPyPI |
quartodoc.yml |
Push/PR to main | Builds API documentation with quartodoc |
quartodoc-publish.yml |
Push to main | Publishes documentation to GitHub Pages |
The quartodoc.yml workflow automatically:
-
Checks out the repository
-
Sets up Python environement
-
Installs package with Document dependencies
-
Runs
quartodoc buildto generate API docs -
Validates the documentation build
The quartodoc-publish.yml workflow automatically
-
Builds the documentation using Quarto
-
Deploys to GitHub Pages when changes are pushed to
main -
Makes documentation available to GitHub Pages URL
Once deployed, documentation is available at:
- GitHub Pages: https://ubc-mds.github.io/DSCI_524_group35_survey_cleaner/
- Netlify: https://dsci524group35surveycleaner.netlify.app/
Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.
Natalie Truesdell, Amanpreet Binepal, Jay Li, Junli
- Copyright © 2026 Natalie Truesdell, Amanpreet Binepal, Jay Li, Junli.
- Free software distributed under the MIT License.